pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
[BUG] Binarisation taking too long #188

Closed sykim14 closed 4 years ago

sykim14 commented 4 years ago

Describe the bug A clear and concise description of what the bug is.

Hi I am trying to binarise AUCell for further analysis. I am currently following codes in SCENICprotocol. I could successfully extract AUCell scroes in .csv files. However it is taking over 24 hours to binarise the data. (My single cell data contains 3282 cells and 442 regulons)

The code I used is... bin_mtx, thresholds = binarize(auc_mtx)

When I stop running it with ctrl+c, it gives me...

^CTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/site-packages/pyscenic/", line 79, in binarize
    thresholds = derive_thresholds(auc_mtx)
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/site-packages/pyscenic/", line 77, in derive_thresholds
    thrs = p.starmap( derive_threshold,  [ (auc_mtx, c, seed) for c in auc_mtx.columns ] )
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/multiprocessing/", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/multiprocessing/", line 651, in get
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/multiprocessing/", line 648, in wait
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/data/syk3418/anaconda3/envs/lungcancer1/lib/python3.7/", line 296, in wait

Thank you.

**Please complete the following information:**
- pySCENIC version: 0.10.2
- Installation method: conda
- Run environment: conda env
- OS: linux? I am running python in conda env through linux terminal
Package versions:

packages in environment at /data/syk3418/anaconda3/envs/lungcancer1:


Name Version Build Channel

cflerin commented 4 years ago

Hi @sykim14 ,

It's not clear from your error what's going on unfortunately. Is the process actually running and using CPU time for the entire 24 hours?

One thing to try: If you're calling that function directly, the default is to use one core, could you try increasing it (if you have the resources available)?

bin_mtx, thresholds = binarize(auc_mtx, num_workers=20)
sykim14 commented 4 years ago

@cflerin Hi, thank you for your reply

I tried changing num_workers but when I 'top' it in linux terminal, I can see my number of cores increasing, but they decrease back to 1 after few seconds, so I don't think it is running properly.

The error messages after ctrl+c doesn't change no matter how long I run the code for, or when I increase num_workers.

How long is this binarisation step usually take?

Thank you

cflerin commented 4 years ago

I just tried this on a matrix of 10k cells by 375 regulons and it completed in 5 minutes using 8 workers. Can you check if your matrix looks right (cells in rows, regulons in columns), like this:

                    ARNTL2_(+)  ASCL2_(+)  ATF1_(+)  ATF3_(+)  ATF4_(+)  ...  \
AAACCCAAGCGCCCAT-1    0.022281   0.000000  0.050656  0.046860  0.030110  ...   
AAACCCACAGAGTTGG-1    0.023544   0.000000  0.030828  0.068298  0.041143  ...   
AAACCCACAGGTATGG-1    0.022685   0.138494  0.062297  0.033759  0.029247  ...   
AAACCCACATAGTCAC-1    0.042440   0.000000  0.051211  0.039889  0.030493  ...   
AAACCCACATCCAATG-1    0.044689   0.115552  0.037881  0.043455  0.018174  ...   
...                        ...        ...       ...       ...       ...  ...   
TTTGTTGGTCTGTAAC-1    0.015865   0.000000  0.051778  0.052338  0.043716  ...   
TTTGTTGGTGCGTCGT-1    0.021624   0.000000  0.044749  0.039394  0.029402  ...   
TTTGTTGGTTTGAACC-1    0.009271   0.000000  0.037608  0.038684  0.050083  ...   
TTTGTTGTCCAAGCCG-1    0.014399   0.000000  0.052829  0.042617  0.045972  ...   
TTTGTTGTCTTACTGT-1    0.015511   0.000000  0.047092  0.041113  0.044001  ...
sykim14 commented 4 years ago

@cflerin Hi, I tried printing aucmtx and it gives me a matrix very similar to yours. It seems like the only difference is that I have 'Regulon' and 'Cell' written as part of the matrix and I am missing '' in Reuglon name. Would this be a problem? Also, I saved this to csv. and found that some 0 are written as 0.0 or 0, they were all in different decimal places.

Would one of these be reasons for my error?

>>> print(auc_mtx)
Regulon                         AHR(+)  ARID3A(+)  ...  ZNF91(+)  ZSCAN29(+)
Cell                                               ...                      
SAMPLE_AAACCTGGTGGCAAAC.1_2   0.000000   0.044381  ...    0.0000         0.0
SAMPLE_AAACGGGAGAGAGCTC.1_2   0.046174   0.034884  ...    0.0000         0.0
SAMPLE_AAACGGGAGATGCGAC.1_2   0.019101   0.021464  ...    0.0000         0.0
SAMPLE_AAAGTAGAGTGAAGAG.1_2   0.014412   0.012412  ...    0.0000         0.0
SAMPLE_AAATGCCGTCCATGAT.1_2   0.021487   0.028292  ...    0.0000         0.0
...                                ...        ...  ...       ...         ...
SAMPLE_TTTGCGCCAGACGTAG.1_11  0.021359   0.028727  ...    0.0000         0.0
SAMPLE_TTTGGTTAGCTACCGC.1_11  0.031224   0.015983  ...    0.0000         0.0
SAMPLE_TTTGGTTGTGCAGACA.1_11  0.027919   0.031384  ...    0.0535         0.0
SAMPLE_TTTGTCAAGGTCATCT.1_11  0.122827   0.032754  ...    0.0000         0.0
SAMPLE_TTTGTCACATGGTCAT.1_11  0.026570   0.014897  ...    0.0000         0.0

[3782 rows x 422 columns]
cflerin commented 4 years ago

The underscore in the regulon names won't be a problem here, and I think pandas should handle the decimals when reading the file. As long as the cell column is the data frame index and not a data column, it should be fine... you could check that auc_mtx.iloc[:,0] returns something like:

Name: AHR_(+), Length: 10280, dtype: float64

I loaded mine with

auc_mtx = pd.read_csv('auc_mtx.tsv', sep='\t', index_col=0)

Did you maybe merge two matrices together? You can check for NAs from the join, which might cause this, with: pd.isna(auc_mtx).sum() (If you find any, I would recommend setting them to 0)

sykim14 commented 4 years ago

@cflerin Thanks for reply again:)

auc_mtx.iloc[:,0] gives me...

>>> auc_mtx.iloc[:,0]
Name: AHR(+), Length: 3782, dtype: float64

Which is similar to what you have got?

pd.isna(auc_mtx).sum() gives me

>>> pd.isna(auc_mtx).sum()
AHR(+)        0
ARID3A(+)     0
ARNT(+)       0
ARNTL(+)      0
ASCL1(+)      0
ZNF84(+)      0
ZNF841(+)     0
ZNF891(+)     0
ZNF91(+)      0
ZSCAN29(+)    0
Length: 422, dtype: int64

I don't think I merged two matrices together. I think everything is set to 0?

Thank you

cflerin commented 4 years ago

Ok, I don't see anything obviously wrong. Are you reading this file from disk? Can you post the first few lines?

Could you possibly try the binarization step on a different compute environment (fresh install, different computer, etc.)?

sykim14 commented 4 years ago


Oh, I was printing auc_mtx from object, not reading it from a saved csv file. The code I used to read csv file is...

AUCELL_MTX_FNAME = os.path.join(DATA_FOLDER, 'cancer_mp_AUCMatrix.csv')
auc_mtx_ca = pd.read_csv(AUCELL_MTX_FNAME, index_col=0)

Oh, although I set index_col=0 , I get 'Cell' as a rowname. Does this mean 'Cell' is part of my data column, not index?

>>> auc_mtx_ca
                                AHR(+)  ARID3A(+)  ...  ZNF91(+)  ZSCAN29(+)
Cell                                               ...                      
SAMPLE_AAACCTGGTGGCAAAC.1_2   0.000000   0.044381  ...    0.0000         0.0
SAMPLE_AAACGGGAGAGAGCTC.1_2   0.046174   0.034884  ...    0.0000         0.0
SAMPLE_AAACGGGAGATGCGAC.1_2   0.019101   0.021464  ...    0.0000         0.0
SAMPLE_AAAGTAGAGTGAAGAG.1_2   0.014412   0.012412  ...    0.0000         0.0
SAMPLE_AAATGCCGTCCATGAT.1_2   0.021487   0.028292  ...    0.0000         0.0
...                                ...        ...  ...       ...         ...
SAMPLE_TTTGCGCCAGACGTAG.1_11  0.021359   0.028727  ...    0.0000         0.0
SAMPLE_TTTGGTTAGCTACCGC.1_11  0.031224   0.015983  ...    0.0000         0.0
SAMPLE_TTTGGTTGTGCAGACA.1_11  0.027919   0.031384  ...    0.0535         0.0
SAMPLE_TTTGTCAAGGTCATCT.1_11  0.122827   0.032754  ...    0.0000         0.0
SAMPLE_TTTGTCACATGGTCAT.1_11  0.026570   0.014897  ...    0.0000         0.0

[3782 rows x 422 columns]

I will try with a fresh installation.

Thank you

cflerin commented 4 years ago

I think the index column title is fine, and a similar matrix works for me here.

This seems to be related to #116 but I don't have any further suggestions right now, sorry.

sykim14 commented 4 years ago

@cflerin Hi, I was able to run binarize with a fresh environment!

Thank you so much for your help!