scenicplus object file request due to run_scenicplus function being extremely slow

elhaam commented 9 months ago

Hello, and thank you for this helpful framework!

What type of problem are you experiencing and which function is you problem related too I am using run_scenicplus according to pbmc_multiome_tutorial.ipynb example and it has been taking half a day and only about 3% done. This is the command I am running now:


try:
    run_scenicplus(
        scplus_obj = scplus_obj,
        variable = ['GEX_celltype'],
        species = 'hsapiens',
        assembly = 'hg38',
        tf_file = 'pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt',
        save_path = os.path.join(work_dir, 'scenicplus'),
        biomart_host = biomart_host,
        upstream = [1000, 150000],
        downstream = [1000, 150000],
        calculate_TF_eGRN_correlation = True,
        calculate_DEGs_DARs = True,
        export_to_loom_file = True,
        export_to_UCSC_file = True,
        path_bedToBigBed = 'pbmc_tutorial',
        n_cpu = 8,
        _temp_dir = os.path.join(tmp_dir, 'ray_spill'))
except Exception as e:
    #in case of failure, still save the object
    dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)
    raise(e)

Describe alternatives you've considered Since my Jupyter killed the process due to memory issue, now I am running it as a script and it has been running on my HPC since 10 hours ago and only 3% done after GSEA. Below is my output so far:

Click to see output

2024-02-27 08:50:27,481 2024-02-27 08:50:27,481 2024-02-27 08:51:15,445 2024-02-27 08:51:16,644 R2G 2024-02-27 08:51:31,430 R2G 2024-02-27 08:51:34,784 R2G Warning! Start and Warning! Start and 2024-02-27 08:51:37,076 R2G Warning! Start and Warning! Start and 2024-02-27 08:51:52,238 R2G Warning! Start and 2024-02-27 08:51:52,981 R2G 2024-02-27 08:53:15,086 R2G 2024-02-27 08:56:10,542 R2G 2024-02-27 08:56:10,879 2024-02-27 08:56:11,056 R2G 2024-02-27 09:12:16,207 R2G 2024-02-27 09:12:16,207 R2G 2024-02-27 09:22:35,653 R2G 2024-02-27 09:22:44,871 R2G 2024-02-27 09:22:45,025 2024-02-27 09:22:49,550 TF2G 2024-02-27 10:11:11,466 TF2G 2024-02-27 10:11:11,481 TF2G 2024-02-27 10:11:40,113 TF2G 2024-02-27 10:11:49,215 TF2G 2024-02-27 10:11:49,238 TF2G 2024-02-27 10:11:49,402 2024-02-27 10:11:49,402 GSEA 2024-02-27 10:26:32,442 GSEA 2024-02-27 10:26:38,058 GSEA 2024-02-27 10:26:37,511 initializing: 0%|▏ (_ray_run_gsea_for_e_module pid=2037561) (_ray_run_gsea_for_e_module (_ray_run_gsea_for_e_module pid=2037561) initializing: 86%|███ initializing: 86%|███ initializing: 100%|███ Running using 8 cores: 3%|█▉ SCENIC+_wrapper INFO pbmc_tutorial/scenicplus folder already exists. SCENIC+_wrapper INFO Merging cistromes SCENIC+_wrapper INFO Getting search space INFO Downloading gene annotation from biomart dataset: hsapiens_gene_ensembl INFO Downloading chromosome sizes from: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes INFO Extending promoter annotation to 10 bp upstream and 10 downstream End columns now have different dtypes: int32 and int64 End columns now have different dtypes: int32 and int64 INFO Extending search space to: 150000 bp downstream of the end of the gene. 150000 bp upstream of the start of the gene. End columns now have different dtypes: int32 and int64 End columns now have different dtypes: int32 and int64 INFO Intersecting with regions. End columns now have different dtypes: int32 and int64 INFO Calculating distances from region to gene INFO Imploding multiple entries per region and gene INFO Done! SCENIC+_wrapper INFO Inferring region to gene relationships INFO Calculating region to gene importances, using GBM method INFO Took 965.1509130001068 seconds INFO Calculating region to gene correlation, using SR method INFO Took 619.4455173015594 seconds INFO Done! SCENIC+_wrapper INFO Inferring TF to gene relationships INFO Calculating TF to gene correlation, using GBM method INFO Took 2901.9156527519226 seconds INFO Adding correlation coefficients to adjacencies. INFO Warning: adding TFs as their own target to adjecencies matrix. Importance values will be max + 1e-05 INFO Adding importance x rho scores to adjacencies. INFO Took 37.75783395767212 seconds SCENIC+_wrapper INFO Build eGRN INFO Thresholding region to gene relationships INFO Subsetting TF2G adjacencies for TF with motif. INFO Running GSEA... INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 | 72/29546 [00:19<2:20:36, 3.49it/s](_ray_run_gsea_for_e_module pid=2037561) /home/ejafari/.local/lib/python3.8/site-packages/gseapy/algorithm.py:71: RuntimeWarning:divide by zero encountered in divide norm_tag = 1.0/sum_correl_tag pid=2037561) /home/ejafari/.local/lib/python3.8/site-packages/gseapy/algorithm.py:74: RuntimeWarning:invalid value encountered in multiply RES = np.cumsum(tag_indicator * correl_vector * norm_tag - no_tag_indicator * norm_no_tag, axis=axis) ███████████████████████████████████████████████████████████▉ | 25466/29546 [21:09<31:44, 2.14it/s](raylet) Spilled 3736 MiB, 10000 objects, write throughput 1881 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message. ███████████████████████████████████████████████████████████▉ | 25466/29546 [21:24<31:44, 2.14it/s](raylet) Spilled 4484 MiB, 12000 objects, write throughput 1419 MiB/s. ██████████████████████████████████████████████████████████████████████| 29546/29546 [46:40<00:00, 10.55it/s] | 1201/36666 [10:14:42<785:49:52, 79.77s/it]

Feature request I was wondering if you could kindly upload the final result of this step as scplusobj.pkl so that we can use your results for downstream analysis for now? I am running the code, but I am unsure when it will finish even though I am running it on HPC. If I can get the results of this step and read the scenicplus object for downstream analysis that would be great. I saw a related issue 248 before, but the link mentioned in the answer is expired and says "File not found. The document could not be found on the server. Maybe the share was deleted or has expired?"._ I know loom files also exist on Scope, is there any way to convert them to scplus object format?

Secondly, as an alternative solution, could you elaborate on issue 202? For me, it says https://github.com/aertslab/scenicplus/blob/development/Snakemake/config/config.yaml is not available. Could you please help with this?

Version information

My scatac_fragment_tools version is 0.1.0 (also tried tag 5a3f538) and Python version is 3.8.18
scatac_fragment_tools: 0.1.0
pycisTopic: 1.0a0
scenicplus 1.0.1.dev6+ge5ba6fc

Thank you so much in advance, Seppe!

SeppeDeWinter commented 8 months ago

Hi @elhaam

For your question relating to the development branch, there is now a command to initialize the snakemake folder. I'm currently working on a tutorial for the development branch, this will come online soon.


scenicplus init_snakemake

Can you send me an e-mail at seppe.dewinter@kuleuven.be, I will send you a new link to the PBMC scenic_plus object.

All the best,

Seppe

elhaam commented 8 months ago

Thank you, Seppe!

aertslab / scenicplus

scenicplus object file request due to run_scenicplus function being extremely slow #309