Open saguilarfer opened 1 year ago
Hi @saguilarfer !
TF-gene inference is the most resource intensive step :/.
load_TF2G_adj_from_file(scplus_obj,
f_adj = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/single_sample_scenic_HQ/out/scenic/10x_multiome_brain_HQ/arboreto_with_multiprocessing/10x_multiome_brain_HQ__adj.tsv',
inplace = True,
key= 'TF2G_adj')
Cheers!
C
Hi @cbravo93
First of all, thank you very much for your quick answer :)
1- Is this step dependent on the total number of genes? I can try to increase memory, but if at some point it is dependent on the number of cells it will for sure crush when using the full dataset.
2- I should not have space problems, I am also working in /scratch
3- I did run pySCENIC for this dataset. But thinking on other projects, I would really like to have SCENIC+ implemented without the need of running in parallel (py)SCENIC.
Any other workaround?
Hi!
Cheers!
C
Hi,
I am using a 10% downsampling of the full dataset (already cleaned) selecting the two major lineajes (B and T cells), which ends up with a dataset of aprox 5K cells and 35K genes. Isn't it sufficiently small to run the pipeline with 600GB RAM?
I was using this dataset as a test to speed up the analysis and have the pipeline ready prior using all the lineages, but at some point I wanted to run it with the full dataset if possible. I tried it with aprox 600GB of RAM; I will try increasing to 1T/2T
As a workaround, I could try what you suggested. However, although I run pySCENIC, I did it for each lineage separately (T and Bcells), also due to memory limits. Therefore, I have two separate adjancency matrices. As such, I will have to run SCENIC+ separately for B and T cells.
As you requested, here I am attaching the logs.zip of the session_2023-01-17_13-21-35_902745_117750
Thank you very much for your help
Hi @saguilarfer !
600GB RAM should be plenty of memory for that size (based on out complexity tests). However, I would recommend filtering more on the genes, not only for scalability but also because keeping lowly expressed genes can give false positives (due to all values being 0). Typically we keep 15-20K genes after filtering in the data set (by default in SCENIC+ we use that the gene is expressed at least in 0.5% of the cells, but you can set other thresholds).
I'll take a look at the logs to see if I can see anything weird. Another option is to set object_store_memory
manually (you can pass it as **kwargs) to 400gb for instance when using 600gb total memory.
Cheers!
C
Hi @cbravo93
Thanks for the recommendation, I will include the gene filtering step in the pipeline. We had the threshold in expressed in min 5 cells
not to lose genes only expressed in very small clusters. However, I agree that for this analysis we could be more stringent.
I will try the workaround with pySCENIC output, please keep me posted with raylet logs
to see if we manage to solve why it is crushing!
Thank you very much for your quick responses and help!
Kind regards,
Hi @saguilarfer,
Thank you for posting this! I am running in to the same issue as you and was wondering if you were able to find a work around.
Cheers, Kevin
Hi @kfenggg
In the development branch I have been working on using joblib instead of ray for parallelisation. This solves many of these issues. Also, the code in that branch is better optmized.
If you would like to use that branch, please see: https://github.com/aertslab/scenicplus/discussions/202
Best,
Seppe
Hi,
Was there every a workaround for this bug that did not involve reducing the number of cells/genes or using the development branch?
Thanks very much.
Hi,
Was there every a workaround for this bug that did not involve reducing the number of cells/genes or using the development branch?
Thanks very much.
For the record, I was able to get this to work by setting a temp directory that already had stuff inside of it.
Hi @mason-sweat1,
I was not able to figure out a workaround that did not involve reducing the number of cells or using the development. I'm glad you were able to get it working though! Could you elaborate on what you had in your temp dir?
Thanks in advance!
We have a directory for our lab that has a ton of space. originally i made a temp folder in the scienicplus parent directory, which didn't have anything in it but was in the lab directory, so theoretically should have access to tons of space. Not sure why but ray had issues with that.
I modified the directory to be in another place in the lab share, upstream of my personal directory. Keep in mind that the total path length needs to be under 45 characters, otherwise you get a different error.
Not sure why that worked for me but its worth a try.
Describe the bug Hi!
I am Sergio Aguilar and I am currently trying to use the pipeline explained here https://scenicplus.readthedocs.io/en/latest/pbmc_multiome_tutorial.html. For this (as test, and prior using the complete dataset) I am using an 10% downsampling of an in-house dataset keeping only the major lineajes.
During most of the steps I was retrieving the following
raylet
warning:In all of them, there was no need for spilling and the code provided worked perfectly (amazing documentation, congratulations). However, I am running the last step with
run_scenicplus
with the following code:and here, once Calculating TF to gene correlation using GBM method, spilled is needed and it crushes because the object was not able to be created:
I tried with different
n_cpu
(1,5,12,30,80) and the same error appears. (I can send the full output if needed, could not include it here due to character limit).Could you please help me with
raylet
?Thank you very much in advance,
Kind regards