aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
163 stars 27 forks source link

scenicplus object file request due to run_scenicplus function being extremely slow #309

Closed elhaam closed 3 months ago

elhaam commented 4 months ago

Hello, and thank you for this helpful framework!

What type of problem are you experiencing and which function is you problem related too I am using run_scenicplus according to pbmc_multiome_tutorial.ipynb example and it has been taking half a day and only about 3% done. This is the command I am running now:


try:
    run_scenicplus(
        scplus_obj = scplus_obj,
        variable = ['GEX_celltype'],
        species = 'hsapiens',
        assembly = 'hg38',
        tf_file = 'pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt',
        save_path = os.path.join(work_dir, 'scenicplus'),
        biomart_host = biomart_host,
        upstream = [1000, 150000],
        downstream = [1000, 150000],
        calculate_TF_eGRN_correlation = True,
        calculate_DEGs_DARs = True,
        export_to_loom_file = True,
        export_to_UCSC_file = True,
        path_bedToBigBed = 'pbmc_tutorial',
        n_cpu = 8,
        _temp_dir = os.path.join(tmp_dir, 'ray_spill'))
except Exception as e:
    #in case of failure, still save the object
    dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)
    raise(e)

Describe alternatives you've considered Since my Jupyter killed the process due to memory issue, now I am running it as a script and it has been running on my HPC since 10 hours ago and only 3% done after GSEA. Below is my output so far:

Click to see output

2024-02-27 08:50:27,481 SCENIC+_wrapper INFO pbmc_tutorial/scenicplus folder already exists. 2024-02-27 08:50:27,481 SCENIC+_wrapper INFO Merging cistromes 2024-02-27 08:51:15,445 SCENIC+_wrapper INFO Getting search space 2024-02-27 08:51:16,644 R2G INFO Downloading gene annotation from biomart dataset: hsapiens_gene_ensembl 2024-02-27 08:51:31,430 R2G INFO Downloading chromosome sizes from: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes 2024-02-27 08:51:34,784 R2G INFO Extending promoter annotation to 10 bp upstream and 10 downstream Warning! Start and End columns now have different dtypes: int32 and int64 Warning! Start and End columns now have different dtypes: int32 and int64 2024-02-27 08:51:37,076 R2G INFO Extending search space to: 150000 bp downstream of the end of the gene. 150000 bp upstream of the start of the gene. Warning! Start and End columns now have different dtypes: int32 and int64 Warning! Start and End columns now have different dtypes: int32 and int64 2024-02-27 08:51:52,238 R2G INFO Intersecting with regions. Warning! Start and End columns now have different dtypes: int32 and int64 2024-02-27 08:51:52,981 R2G INFO Calculating distances from region to gene 2024-02-27 08:53:15,086 R2G INFO Imploding multiple entries per region and gene 2024-02-27 08:56:10,542 R2G INFO Done! 2024-02-27 08:56:10,879 SCENIC+_wrapper INFO Inferring region to gene relationships 2024-02-27 08:56:11,056 R2G INFO Calculating region to gene importances, using GBM method 2024-02-27 09:12:16,207 R2G INFO Took 965.1509130001068 seconds 2024-02-27 09:12:16,207 R2G INFO Calculating region to gene correlation, using SR method 2024-02-27 09:22:35,653 R2G INFO Took 619.4455173015594 seconds 2024-02-27 09:22:44,871 R2G INFO Done! 2024-02-27 09:22:45,025 SCENIC+_wrapper INFO Inferring TF to gene relationships 2024-02-27 09:22:49,550 TF2G INFO Calculating TF to gene correlation, using GBM method 2024-02-27 10:11:11,466 TF2G INFO Took 2901.9156527519226 seconds 2024-02-27 10:11:11,481 TF2G INFO Adding correlation coefficients to adjacencies. 2024-02-27 10:11:40,113 TF2G INFO Warning: adding TFs as their own target to adjecencies matrix. Importance values will be max + 1e-05 2024-02-27 10:11:49,215 TF2G INFO Adding importance x rho scores to adjacencies. 2024-02-27 10:11:49,238 TF2G INFO Took 37.75783395767212 seconds 2024-02-27 10:11:49,402 SCENIC+_wrapper INFO Build eGRN 2024-02-27 10:11:49,402 GSEA INFO Thresholding region to gene relationships 2024-02-27 10:26:32,442 GSEA INFO Subsetting TF2G adjacencies for TF with motif. 2024-02-27 10:26:38,058 GSEA INFO Running GSEA... 2024-02-27 10:26:37,511 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 initializing: 0%|▏ | 72/29546 [00:19<2:20:36, 3.49it/s](_ray_run_gsea_for_e_module pid=2037561) /home/ejafari/.local/lib/python3.8/site-packages/gseapy/algorithm.py:71: RuntimeWarning:divide by zero encountered in divide (_ray_run_gsea_for_e_module pid=2037561) norm_tag = 1.0/sum_correl_tag (_ray_run_gsea_for_e_module pid=2037561) /home/ejafari/.local/lib/python3.8/site-packages/gseapy/algorithm.py:74: RuntimeWarning:invalid value encountered in multiply (_ray_run_gsea_for_e_module pid=2037561) RES = np.cumsum(tag_indicator * correl_vector * norm_tag - no_tag_indicator * norm_no_tag, axis=axis) initializing: 86%|██████████████████████████████████████████████████████████████▉ | 25466/29546 [21:09<31:44, 2.14it/s](raylet) Spilled 3736 MiB, 10000 objects, write throughput 1881 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message. initializing: 86%|██████████████████████████████████████████████████████████████▉ | 25466/29546 [21:24<31:44, 2.14it/s](raylet) Spilled 4484 MiB, 12000 objects, write throughput 1419 MiB/s. initializing: 100%|█████████████████████████████████████████████████████████████████████████| 29546/29546 [46:40<00:00, 10.55it/s] Running using 8 cores: 3%|█▉ | 1201/36666 [10:14:42<785:49:52, 79.77s/it]

Feature request I was wondering if you could kindly upload the final result of this step as scplusobj.pkl so that we can use your results for downstream analysis for now? I am running the code, but I am unsure when it will finish even though I am running it on HPC. If I can get the results of this step and read the scenicplus object for downstream analysis that would be great. I saw a related issue 248 before, but the link mentioned in the answer is expired and says "File not found. The document could not be found on the server. Maybe the share was deleted or has expired?"._ I know loom files also exist on Scope, is there any way to convert them to scplus object format?

Secondly, as an alternative solution, could you elaborate on issue 202? For me, it says https://github.com/aertslab/scenicplus/blob/development/Snakemake/config/config.yaml is not available. Could you please help with this?

Version information

Thank you so much in advance, Seppe!

SeppeDeWinter commented 3 months ago

Hi @elhaam

For your question relating to the development branch, there is now a command to initialize the snakemake folder. I'm currently working on a tutorial for the development branch, this will come online soon.


scenicplus init_snakemake

Can you send me an e-mail at seppe.dewinter@kuleuven.be, I will send you a new link to the PBMC scenic_plus object.

All the best,

Seppe

elhaam commented 3 months ago

Thank you, Seppe!