aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
163 stars 27 forks source link

Empty region_to_gene_importances and region_to_gene_correlation when run_scenicplus #325

Closed YaoLi3 closed 3 months ago

YaoLi3 commented 3 months ago

Discussed in https://github.com/aertslab/scenicplus/discussions/311

Originally posted by **YaoLi3** February 29, 2024 Dear team, I found out that I kept getting empty dictionaries for `region_to_gene_importances` and `region_to_gene_correlation` in function `calculate_regions_to_genes_relationships`. This is how my scplus_obj looks so far: ``` >>> scplus_obj SCENIC+ object with n_cells x n_genes = 380 x 27301 and n_cells x n_regions = 380 x 13303 metadata_regions:'Chromosome', 'Start', 'End', 'Width', 'cisTopic_nr_frag', 'cisTopic_log_nr_frag', 'cisTopic_nr_acc', 'cisTopic_log_nr_acc' metadata_genes:'gene', 'HumanSymbol', 'ZebrafishSymbol', 'MedakaSymbol' metadata_cell:'celltype' menr:'CTX_topics_otsu_All', 'CTX_topics_otsu_No_promoters', 'DEM_topics_otsu_All', 'DEM_topics_otsu_No_promoters', 'CTX_DARs_region_bin_topics_top3k_All', 'CTX_DARs_region_bin_topics_top3k_No_promoters', 'DEM_DARs_region_bin_topics_top3k_All', 'DEM_DARs_region_bin_topics_top3k_No_promoters', 'CTX_DARs_markers_dict_All', 'CTX_DARs_markers_dict_No_promoters', 'DEM_DARs_markers_dict_All', 'DEM_DARs_markers_dict_No_promoters' ``` uns keys I managed to obtain: ``` >>> scplus_obj.uns.keys() dict_keys(['search_space', 'Cistromes']) >>> print(list(signatures.keys())[0:5]) ['EPAS1_(2r)__CTX_topics_otsu_All__Topic1', 'GATA3_extended_(2r)__CTX_topics_otsu_All__Topic1', 'CEBPA_extended_(5r)__CTX_topics_otsu_All__Topic1', 'AHRR_extended_(2r)__CTX_topics_otsu_All__Topic1', 'EPAS1_extended_(2r)__CTX_topics_otsu_All__Topic1'] >>> scplus_obj.metadata_regions['Chromosome'] NC_018546.1:5882-6382 NC_018546.1 NC_018546.1:13199-13699 NC_018546.1 NC_018546.1:8630-9130 NC_018546.1 NC_018546.1:16189-16689 NC_018546.1 NC_018546.1:2799-3299 NC_018546.1 ... NW_023419892.1:1214-1714 NW_023419892.1 NW_023420775.1:1304-1804 NW_023420775.1 NW_023422538.1:881-1381 NW_023422538.1 NW_023423559.1:590-1090 NW_023423559.1 NW_023424523.1:95-595 NW_023424523.1 Name: Chromosome, Length: 13303, dtype: object ``` search space and Cistromes: ``` >>> scplus_obj.uns['search_space'] Name Gene Distance 0 NC_018546.1:413-913 unassigned_transcript_779 [0] 1 NC_018546.1:413-913 unassigned_transcript_780 [348] 2 NC_018546.1:1486-1986 unassigned_transcript_781 [0] 3 NC_018546.1:2027-2527 unassigned_transcript_781 [0] 4 NC_018546.1:923-1423 unassigned_transcript_781 [0] ... ... ... ... 2110 NW_023418709.1:3192-3692 XR_004947600.1 [1662] 2111 NW_023418725.1:3081-3581 XM_036211287.1 [0] 2112 NW_023419487.1:1261-1761 XM_024261841.2 [58] 2113 NW_023419487.1:154-654 XM_024261841.2 [1165] 2114 NW_023423559.1:590-1090 XM_024261831.1 [0] [2115 rows x 3 columns] >>> scplus_obj.uns['Cistromes'] {'Unfiltered': {'ASCL1_(14r)': +----------------+-----------+-----------+ | Chromosome | Start | End | | (category) | (int64) | (int64) | |----------------+-----------+-----------| | NC_050516.1 | 15666009 | 15666509 | | NC_050522.1 | 24706071 | 24706571 | | NC_050524.1 | 33821116 | 33821616 | | NC_050526.1 | 15299260 | 15299760 | | ... | ... | ... | | NW_023416925.1 | 96474 | 96974 | | NW_023417048.1 | 48904 | 49404 | | NW_023417060.1 | 17959 | 18459 | | NW_023417405.1 | 72 | 572 | +----------------+-----------+-----------+ Unstranded PyRanges object has 14 rows and 3 columns from 14 chromosomes. For printing, the PyRanges was sorted on Chromosome. >>> len(scplus_obj.uns['Cistromes']['Unfiltered']) 59 ``` Before `calculate_regions_to_genes_relationships`: At first I was running `get_search_space` on all default parameters, and end up getting negative Start positions (which doesn't make sense) and negative Distances. (By the way, it seems like users will not be warned when negative numbers occur and no code are handling this situation. Indeed, users should pass in upstream/downstream range according to their target species genome size, but lots of people like to run software on default first, and in some rare cases, this could happen) image ``` ---------------------search_space--------------------- Name Gene Distance 0 NC_018546.1:413-913 unassigned_transcript_778 [-594] 1 NC_018546.1:923-1423 unassigned_transcript_778 [-1104] 2 NC_018546.1:413-913 unassigned_transcript_779 [0] 3 NC_018546.1:923-1423 unassigned_transcript_779 [-163] 4 NC_018546.1:1486-1986 unassigned_transcript_780 [-654] ... ... ... ... 3329 NW_023419278.1:838-1338 XM_024264679.1 [-361] 3330 NW_023419487.1:1261-1761 XM_024261841.2 [58] 3331 NW_023419487.1:154-654 XM_024261841.2 [1165] 3332 NW_023423559.1:590-1090 XM_024261831.1 [0] 3333 NW_023424523.1:95-595 XR_002917844.2 [-148] [3334 rows x 3 columns] ``` What I have done: 1. set use_gene_boundaries to `True`; 2. set upstream and downstream to very small numbers. Still getting negative Start_b at this point; 3. turn negative Start_b into 0; 4. filter out regions_per_gene that has negative Distance; Still have all empty `region_to_gene_importances` and `region_to_gene_correlation`. Any help will be much appreciated! Thanks in advance, Yao