Raylet spilling error when running run_scenicplus

saguilarfer commented 1 year ago

Describe the bug Hi!

I am Sergio Aguilar and I am currently trying to use the pipeline explained here https://scenicplus.readthedocs.io/en/latest/pbmc_multiome_tutorial.html. For this (as test, and prior using the complete dataset) I am using an 10% downsampling of an in-house dataset keeping only the major lineajes.

During most of the steps I was retrieving the following raylet warning:

(raylet) [2023-01-17 13:21:50,990 E 123488 123509] (raylet) file_system_monitor.cc:105: /scratch/devel/saguilar/tmp/ray_spill/session_2023-01-17_13-21-35_902745_117750 is over 95% full, available space: 160982840721408; capacity: 4878630700990464. Object creation will fail if spilling is required.

In all of them, there was no need for spilling and the code provided worked perfectly (amazing documentation, congratulations). However, I am running the last step with run_scenicplus with the following code:

scplus_obj = dill.load(open(scplus_obj_path, 'rb'))
biomart_host = "http://sep2019.archive.ensembl.org/"

#only keep the first two columns of the PCA embedding in order to be able to visualize this in SCope
scplus_obj.dr_cell['GEX_X_pca'] = scplus_obj.dr_cell['GEX_X_pca'].iloc[:, 0:2]

from scenicplus.wrappers.run_scenicplus import run_scenicplus
try:
    run_scenicplus(
        scplus_obj = scplus_obj,
        variable = ['GEX_celltype'],
        species = 'hsapiens',
        assembly = 'hg38',
        tf_file = TF_annot_fpath + 'utoronto_human_tfs_v_1.01.txt',
        save_path = scenicplusDir + 'objects/',
        biomart_host = biomart_host,
        upstream = [1000, 150000],
        downstream = [1000, 150000],
        calculate_TF_eGRN_correlation = True,
        calculate_DEGs_DARs = True,
        export_to_loom_file = True,
        export_to_UCSC_file = True,
        path_bedToBigBed = bedToBigBed_path,
        n_cpu = int(n_cpu),
        _temp_dir = os.path.join(tmpDir, 'ray_spill'))
except Exception as e:
    #in case of failure, still save the object
    dill.dump(scplus_obj, open(scenicplusDir + 'objects/scplus_obj.pkl', 'wb'), protocol=-1)
    raise(e)

and here, once Calculating TF to gene correlation using GBM method, spilled is needed and it crushes because the object was not able to be created:

2023-01-17 13:21:42,033 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
2023-01-17 13:21:42,553 TF2G         INFO     Calculating TF to gene correlation, using GBM method
initializing:   0%|▏                                                                                                                                                   | 49/37378 [00:10<2:24:04,  4.32it/s](raylet) [2023-01-17 13:21:50,990 E 123488 123509] (raylet) file_system_monitor.cc:105: /scratch/devel/saguilar/tmp/ray_spill/session_2023-01-17_13-21-35_902745_117750 is over 95% full, available space: 160982840721408; capacity: 4878630700990464. Object creation will fail if spilling is required.
initializing:   0%|▎                                                                                                                                                   | 86/37378 [00:18<2:21:20,  4.40it/s](raylet) [2023-01-17 13:22:01,086 E 123488 123509] (raylet) file_system_monitor.cc:105: /scratch/devel/saguilar/tmp/ray_spill/session_2023-01-17_13-21-35_902745_117750 is over 95% full, available space: 160976900902912; capacity: 4878630700990464. Object creation will fail if spilling is required.
initializing:   0%|▌                                                                                                                                                  | 134/37378 [00:28<2:23:16,  4.33it/s](raylet) [2023-01-17 13:22:11,097 E 123488 123509] (raylet) file_system_monitor.cc:105: /scratch/devel/saguilar/tmp/ray_spill/session_2023-01-17_13-21-35_902745_117750 is over 95% full, available space: 160969224937472; capacity: 4878630700990464. Object creation will fail if spilling is required.
initializing:   0%|▋                                                                                                                                                  | 183/37378 [00:38<2:16:07,  4.55it/s](raylet) [2023-01-17 13:22:21,109 E 123488 123509] (raylet) file_system_monitor.cc:105: /scratch/devel/saguilar/tmp/ray_spill/session_2023-01-17_13-21-35_902745_117750 is over 95% full, available space: 160963448729600; capacity: 4878630700990464. Object creation will fail if spilling is required.
initializing:   1%|▊                                                                                                                                                  | 221/37378 [00:48<2:23:22,  4.32it/s](raylet) [2023-01-17 13:22:31,132 E 123488 123509] (raylet) file_system_monitor.cc:105: /scratch/devel/saguilar/tmp/ray_spill/session_2023-01-17_13-21-35_902745_117750 is over 95% full, available space: 160956078645248; capacity: 4878630700990464. Object creation will fail if spilling is required.
initializing:   1%|▉                                                                                                                                                  | 241/37378 [00:52<1:55:24,  5.36it/s](raylet) Spilled 2336 MiB, 9 objects, write throughput 832 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
initializing:   1%|▉                                                                                                                                                  | 246/37378 [00:53<2:07:26,  4.86it/s](raylet) Spilled 4673 MiB, 18 objects, write throughput 1227 MiB/s.
initializing:   1%|█                                                                                                                                                  | 261/37378 [00:56<2:02:43,  5.04it/s](raylet) Spilled 10093 MiB, 38 objects, write throughput 1495 MiB/s.
initializing:   1%|█                                                                                                                                                  | 273/37378 [00:58<1:52:46,  5.48it/s](raylet) [2023-01-17 13:22:41,144 E 123488 123509] (raylet) file_system_monitor.cc:105: /scratch/devel/saguilar/tmp/ray_spill/session_2023-01-17_13-21-35_902745_117750 is over 95% full, available space: 160926015053824; capacity: 4878630700990464. Object creation will fail if spilling is required.
initializing:   1%|█                                                                                                                                                  | 276/37378 [00:59<2:12:32,  4.67it/s]
Local disk is full
The object cannot be created because the local object store is full and the local disk's utilization is over capacity (95% by default).Tip: Use `df` on this node to check disk usage and `ray memory` to check object store memory usage.
2023-01-17 13:23:00,153 TF2G         INFO     Took 77.59992384910583 seconds
2023-01-17 13:23:00,153 TF2G         INFO     Adding correlation coefficients to adjacencies.
Traceback (most recent call last):
  File "<stdin>", line 22, in <module>
  File "<stdin>", line 2, in <module>
  File "/home/groups/biomed/saguilar/scenicplus/src/scenicplus/wrappers/run_scenicplus.py", line 151, in run_scenicplus
    calculate_TFs_to_genes_relationships(scplus_obj, 
  File "/home/groups/biomed/saguilar/scenicplus/src/scenicplus/TF_to_gene.py", line 332, in calculate_TFs_to_genes_relationships
    adj = pd.concat(tfs_to_genes).sort_values(by='importance', ascending=False)
UnboundLocalError: local variable 'tfs_to_genes' referenced before assignment

I tried with different n_cpu (1,5,12,30,80) and the same error appears. (I can send the full output if needed, could not include it here due to character limit).

Could you please help me with raylet?

Thank you very much in advance,

Kind regards

cbravo93 commented 1 year ago

Hi @saguilarfer !

TF-gene inference is the most resource intensive step :/.

Ray tends to spill objects when it gets to the memory limit, so if possible you can try increasing memory.
Based on the error, i think the problem is that the folder where it is spilling (tmpDir) is getting full. We had this before when we were using /tmp in our servers, now we work in /scratch (where we have more space). Could this be an issue in you case?

If you have run (py)SCENIC you can skip this step.

load_TF2G_adj_from_file(scplus_obj, 
                    f_adj = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/single_sample_scenic_HQ/out/scenic/10x_multiome_brain_HQ/arboreto_with_multiprocessing/10x_multiome_brain_HQ__adj.tsv', 
                    inplace = True, 
                    key= 'TF2G_adj')

Cheers!

C

saguilarfer commented 1 year ago

Hi @cbravo93

First of all, thank you very much for your quick answer :)

1- Is this step dependent on the total number of genes? I can try to increase memory, but if at some point it is dependent on the number of cells it will for sure crush when using the full dataset.

2- I should not have space problems, I am also working in /scratch

3- I did run pySCENIC for this dataset. But thinking on other projects, I would really like to have SCENIC+ implemented without the need of running in parallel (py)SCENIC.

Any other workaround?

cbravo93 commented 1 year ago

Hi!

Yes! The step is dependent of the number of cells and coverage (which also affects the number of genes).
Something that is in our to do list: instead of using a given list of TFs, you can use the TFs for which you actually have cistromes (in the cistromes slot). The advantage of this is that you will likely reduce the number of TFs to at least half; disadvantage is that if you want to use the adjajencies later on with other parameters you may miss TFs.
When developing SCENIC, we had even more problems with scalability with GENIE3 in R. An approach that worked well for very large data sets was to downsample (or take the best/less sparse) cells (per cluster preferably). Then the rest of the cells can come back for the regulon scoring.
Can you provide the ray logs?

Cheers!

C

saguilarfer commented 1 year ago

Hi,

I am using a 10% downsampling of the full dataset (already cleaned) selecting the two major lineajes (B and T cells), which ends up with a dataset of aprox 5K cells and 35K genes. Isn't it sufficiently small to run the pipeline with 600GB RAM?

I was using this dataset as a test to speed up the analysis and have the pipeline ready prior using all the lineages, but at some point I wanted to run it with the full dataset if possible. I tried it with aprox 600GB of RAM; I will try increasing to 1T/2T

As a workaround, I could try what you suggested. However, although I run pySCENIC, I did it for each lineage separately (T and Bcells), also due to memory limits. Therefore, I have two separate adjancency matrices. As such, I will have to run SCENIC+ separately for B and T cells.

Do you think this could affect the results of SCENIC+? There is variability within lineages, but not as much as with different lineages.
I was also hoping to corroborate and confirm results obtained with pySCENIC using SCENIC+. Could this workaround solution influence this ""validation/corroboration""?

As you requested, here I am attaching the logs.zip of the session_2023-01-17_13-21-35_902745_117750

Thank you very much for your help

cbravo93 commented 1 year ago

Hi @saguilarfer !

600GB RAM should be plenty of memory for that size (based on out complexity tests). However, I would recommend filtering more on the genes, not only for scalability but also because keeping lowly expressed genes can give false positives (due to all values being 0). Typically we keep 15-20K genes after filtering in the data set (by default in SCENIC+ we use that the gene is expressed at least in 0.5% of the cells, but you can set other thresholds).

I'll take a look at the logs to see if I can see anything weird. Another option is to set object_store_memory manually (you can pass it as **kwargs) to 400gb for instance when using 600gb total memory.

Cheers!

C

saguilarfer commented 1 year ago

Hi @cbravo93

Thanks for the recommendation, I will include the gene filtering step in the pipeline. We had the threshold in expressed in min 5 cells not to lose genes only expressed in very small clusters. However, I agree that for this analysis we could be more stringent.

I will try the workaround with pySCENIC output, please keep me posted with raylet logs to see if we manage to solve why it is crushing!

Thank you very much for your quick responses and help!

Kind regards,

kfenggg commented 10 months ago

Hi @saguilarfer,

Thank you for posting this! I am running in to the same issue as you and was wondering if you were able to find a work around.

Cheers, Kevin

SeppeDeWinter commented 9 months ago

Hi @kfenggg

In the development branch I have been working on using joblib instead of ray for parallelisation. This solves many of these issues. Also, the code in that branch is better optmized.

If you would like to use that branch, please see: https://github.com/aertslab/scenicplus/discussions/202

Best,

Seppe

mason-sweat1 commented 7 months ago

Hi,

Was there every a workaround for this bug that did not involve reducing the number of cells/genes or using the development branch?

Thanks very much.

mason-sweat1 commented 7 months ago

Hi,

Was there every a workaround for this bug that did not involve reducing the number of cells/genes or using the development branch?

Thanks very much.

For the record, I was able to get this to work by setting a temp directory that already had stuff inside of it.

kfenggg commented 7 months ago

Hi @mason-sweat1,

I was not able to figure out a workaround that did not involve reducing the number of cells or using the development. I'm glad you were able to get it working though! Could you elaborate on what you had in your temp dir?

Thanks in advance!

mason-sweat1 commented 7 months ago

We have a directory for our lab that has a ton of space. originally i made a temp folder in the scienicplus parent directory, which didn't have anything in it but was in the lab directory, so theoretically should have access to tons of space. Not sure why but ray had issues with that.

I modified the directory to be in another place in the lab share, upstream of my personal directory. Keep in mind that the total path length needs to be under 45 characters, otherwise you get a different error.

Not sure why that worked for me but its worth a try.

aertslab / scenicplus

Raylet spilling error when running run_scenicplus #89