DAR calculation results in empty dictionary values for some keys

SidG13 commented 1 year ago

Hi, this isn't an issue with the program itself but I'm just trying to understand better ways to move forward with my data.

I'm running SCENIC+ on my own multiome data and it looks fine until I calculate DARs. The imputed sparsity is very low which may be one issue (running the default command): Imputed accessibility sparsity: 0.00243018816125784

And some of my cell clusters (clusters 0 and 1) are empty in the dictionary value:

>>>print(markers_dict)

{'0': Empty DataFrame
 Columns: [Log2FC, Adjusted_pval, Contrast]
 Index: [],
 '1': Empty DataFrame
 Columns: [Log2FC, Adjusted_pval, Contrast]
 Index: [],
 '2':                            Log2FC Adjusted_pval Contrast
 mi7:4818657-4819157      0.769073      0.000001        2
 ma6:66568746-66569246    0.768514      0.000001        2
 ma6:29137929-29138429    0.766684      0.000001        2
 ma1:278987328-278987828  0.766288      0.000001        2
 ma5:3335808-3336308      0.765712      0.000001        2
 ...                           ...           ...      ...
 mi4:7649213-7649713      0.586276      0.000001        2
 mi1:17598053-17598553    0.586209      0.000001        2
 mi1:16983098-16983598    0.586181      0.000001        2
 ma3:152108281-152108781  0.586084      0.000001        2
 ma1:147379789-147380289  0.585201      0.000001        2

 [2419 rows x 3 columns],
 '3':                            Log2FC Adjusted_pval Contrast
 mi7:4818657-4819157      1.839034           0.0        3
 ma6:29137929-29138429    1.834527           0.0        3
 ma6:66568746-66569246    1.831466           0.0        3
 Z:99232385-99232885      1.829914           0.0        3
 ma5:3335808-3336308      1.824418           0.0        3
 ...                           ...           ...      ...
 mi1:18492676-18493176    0.586122           0.0        3
 ma1:139951997-139952497   0.58608           0.0        3
 ma2:83561973-83562473    0.585666           0.0        3
 ma4:33619315-33619815    0.585487           0.0        3
 ma1:227069338-227069838  0.585128           0.0        3

 [10916 rows x 3 columns],
 '4':                            Log2FC Adjusted_pval Contrast
 ma6:66568746-66569246     0.74331           0.0        4
 mi7:4818657-4819157      0.742966           0.0        4
 ma6:29137929-29138429    0.741914           0.0        4
 ma5:3335808-3336308      0.740059           0.0        4
 Z:99232385-99232885      0.738471           0.0        4
 ...                           ...           ...      ...
 ma2:127339096-127339596  0.585864           0.0        4
 ma6:52456453-52456953    0.585728           0.0        4
 mi3:16386390-16386890    0.585616           0.0        4
 ma7:35807057-35807557    0.585456           0.0        4
 ma3:107703765-107704265  0.585026           0.0        4

 [2622 rows x 3 columns],
 '5':                          Log2FC Adjusted_pval Contrast
 ma1:86627541-86628041  1.970056           0.0        5
 ma2:94271354-94271854  1.954057           0.0        5
 mi1:9052675-9053175    1.941548           0.0        5
 mi4:15972036-15972536  1.941548           0.0        5
 mi8:2047445-2047945    1.941118           0.0        5
 ...                         ...           ...      ...
 ma6:49717359-49717859  0.584988           0.0        5
 ma6:63083984-63084484  0.584988           0.0        5
 ma6:64046189-64046689  0.584988           0.0        5
 mi1:15376840-15377340  0.584988           0.0        5
 mi5:2253119-2253619    0.584988           0.0        5

 [46195 rows x 3 columns],
 '6':                            Log2FC Adjusted_pval Contrast
 ma5:63752539-63753039    1.111123           0.0        6
 ma2:143088687-143089187  1.104998           0.0        6
 ma2:182001025-182001525  1.103168           0.0        6
 mi4:16031232-16031732    1.101949           0.0        6
 ma2:119763231-119763731  1.097899           0.0        6
 ...                           ...           ...      ...
 Z:86427065-86427565      0.585042           0.0        6
 ma2:101515606-101516106  0.585042           0.0        6
 ma1:142062650-142063150  0.585022           0.0        6
 ma2:128449346-128449846   0.58502           0.0        6
 ma6:64347065-64347565    0.585002           0.0        6

 [39226 rows x 3 columns],
 '7':                          Log2FC Adjusted_pval Contrast
 ma1:86627541-86628041  1.572187           0.0        7
 ma2:94271354-94271854  1.565921           0.0        7
 mi1:9052675-9053175    1.558166           0.0        7
 mi4:15972036-15972536  1.558166           0.0        7
 mi8:2047445-2047945    1.553827           0.0        7
 ...                         ...           ...      ...
 ma1:39354640-39355140  0.585174      0.000008        7
 ma4:1732112-1732612    0.585075       0.00002        7
 ma2:67501656-67502156  0.585058      0.000015        7
 ma2:72275760-72276260  0.585058      0.000015        7
 ma3:74508849-74509349     0.585           0.0        7

 [44975 rows x 3 columns]}

Having empty values will cause downstream issues. I was wondering if anyone had any input on why this might be happening or if there are reasonable ways to fix this without affecting the data interpretation too much? I'm not too familiar with cistopic so any help with this would be much appreciated! Thanks

rkelly712 commented 1 year ago

Hi @SidG13,

I've run into the same issue so i was wondering how you ended up fixing/resolving this issue? I've had a similar problem arising from the fact that i've quite a few cell types with small numbers of cells.

Thanks in advance.

SeppeDeWinter commented 1 year ago

Hi both

The missing values are due to these regions being filtered out, i.e. they don't pass the adjpval_thr (0.05) and log2fc_thr (log2(1.5)) thresholds. You could adapt these thresholds.

However you say it also causes downstream issues, this is not ideal. I will put a pin in this and check what happens if some values are empty. Preferably the code should check for this, but we might have missed it in certain parts.

Best,

Seppe

SidG13 commented 1 year ago

Hey sorry this is incredibly late, I actually did this solution, using the if statement. This was @SeppeDeWinter solution from another thread, I can't find it now.

import pyranges as pr
from pycistarget.utils import region_names_to_coordinates
region_sets = {}
region_sets['topics_otsu'] = {}
region_sets['topics_top_3'] = {}
region_sets['DARs'] = {}
for topic in region_bin_topics_otsu.keys():
    regions = region_bin_topics_otsu[topic].index[region_bin_topics_otsu[topic].index.str.startswith(('Scaffold'))] #only keep regions on known chromosomes
    region_sets['topics_otsu'][topic] = pr.PyRanges(region_names_to_coordinates(regions))
for topic in region_bin_topics_top3k.keys():
    regions = region_bin_topics_top3k[topic].index[region_bin_topics_top3k[topic].index.str.startswith(('Scaffold'))] #only keep regions on known chromosomes
    region_sets['topics_top_3'][topic] = pr.PyRanges(region_names_to_coordinates(regions))
for DAR in markers_dict.keys():
    regions = markers_dict[DAR].index[markers_dict[DAR].index.str.startswith(('Scaffold'))] #only keep regions on known chromosomes
    if len(regions) > 0:
        region_sets['DARs'][DAR] = pr.PyRanges(region_names_to_coordinates(regions))

aertslab / scenicplus

DAR calculation results in empty dictionary values for some keys #53