aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
167 stars 27 forks source link

How to speed up the "calculate_TFs_to_genes_relationships" function wrapped in "run_scenicplus"? #52

Closed EthanZhaoChem closed 1 year ago

EthanZhaoChem commented 1 year ago

Dear authors,

Thank you for providing such a great tool for regulatory analysis. I am trying to use it on my non-multiome dataset (separate scRNA and scATAC datasets).

Task Here I am running the functions wrapped in run_scenicplus step by step. While runing the function calculate_TFs_to_genes_relationships by the following command:

calculate_TFs_to_genes_relationships(scplus_obj, 
                tf_file = tf_file,
                ray_n_cpu = 20, 
                method = 'GBM',
                _temp_dir = _temp_dir,
                key= 'TF2G_adj',
                object_store_memory = 400E9)

Problems I found the job is very slow and consuming lots of memory. I am running this job on a node with 750G memory and 20 cpu cores. It is requesting hundreds of hours to finish. Here is the last few lines of the job out put. I didn't finish the job because it is also becoming more and more slow.

initializing:   2%|▏         | 564/29773 [1:37:21<84:50:12, 10.46s/it]
initializing:   2%|▏         | 565/29773 [1:37:55<141:31:38, 17.44s/it]
initializing:   2%|▏         | 566/29773 [1:38:08<130:56:20, 16.14s/it]
initializing:   2%|▏         | 567/29773 [1:39:48<335:17:22, 41.33s/it]
initializing:   2%|▏         | 568/29773 [1:39:57<255:00:18, 31.43s/it]
initializing:   2%|▏         | 569/29773 [1:39:57<179:15:03, 22.10s/it]

Dataset description The dataset I am using has 5590 metacells (generated from 5590*5 cells), 29773 genes, 249968 regions. And I think it is a reasonable dataset size comparing with the pbmc dataset posted in the first tutorial.

scplus_obj.X_ACC.shape = (249968, 5590)
scplus_obj.X_EXP.shape = (5590, 29773)

The TF list I am using is utoronto_human_tfs_v_1.01.txt from the first tutorial.

The methods I tried I tried to change the regression method to RF but its performance is similar. I also tried to subset shorter TF lists and do a manual parallelization, but the improvement is marginal.

Questions The number of metacells is smaller than 6k, why is it significantly slower than the pbmc dataset in the tutorial? Is there a way to improve the speed and lower the memory usage? Because it runs slower and slower, I can imagine it would be difficult for this job to finish even within one week.

Version information SCENIC+: '0.1.dev445+g615c88b'

Thanks, Ethan

cbravo93 commented 1 year ago

Hi @EthanZhaoChem !

These times are really not normal, specially for the size of your data set should take about 30 min.

Can you confirm that it is running in parallel (are the 20 cores are doing something or only one)? Did you get any additional warning message from ray?

PD: Did you apply any filter on the gene expression? You have a lot of genes

Cheers!

C

EthanZhaoChem commented 1 year ago

Hi @cbravo93,

Thanks for your prompt help. I checked the ray dashboard while running the jobs and found most ray jobs are not consuming any cpu resources or object store memory during the initialization stage. So I went to check the source code and realized initialization is actually not a parallel process.

Source code check

for gene in tqdm(gene_names, total=len(gene_names), desc='initializing'):
            jobs.append(_run_infer_partial_network_ray.remote(gene,
                                                             gene_names,
                                                             ex_matrix,
                                                             method_params,
                                                             tf_matrix,
                                                             tf_matrix_gene_names))

I guess it is looping all the genes and assigning them to a ray job one by one. After initialization, the jobs will be distributed to multiple cpu cores so that they can by run in parallel.

Ray dashboard In this case, I checked this issue using 180G memory with 8 cpu cores, and here is a screenshot of the ray dashboard during the initialization process, before it finally crashed.

Screen Shot 2022-10-18 at 9 37 53 AM Screen Shot 2022-10-18 at 9 38 15 AM

Finally it crashed Finally, after about half an hour, the jobs are failed and here are the output messages, with the full log attached.

| 1205/29773 [30:51<21:12:00,  2.67s/it]2022-10-18 09:47:28,389 
WARNING worker.py:1829 -- The node with node id: e990f42b2d5c15b82ecc5c119dbcbbe2c00b9b79b7071aee97949082 and address: 10.50.250.47 and node name: 10.50.250.47 has been marked dead because the detector has missed too many heartbeats from it. 
This can happen when a  
(1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
(2) raylet has lagging heartbeats due to slow network or busy workload.

[2022-10-18 09:47:28,482 E 2401128 2401817] core_worker.cc:492: :info_message:
 Attempting to recover 1604 lost objects by resubmitting their tasks. To disable object reconstruction, set @ray.remote(max_retries=0).
initializing:   4%|████                       | 1206/29773 [32:42<276:03:56, 34.79s/it](raylet) 
[2022-10-18 09:48:48,793 C 2401751 2401751] (raylet) node_manager.cc:988: 
[Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.

Here is the full log for running the function calculate_TFs_to_genes_relationships: log.txt.

Methods that I can think of I tried increasing the object_store_memory in ray to 350GB on a node with 750GB, but the speed is still slow and becoming slower and slower while running, which I think will also finally crash after a few hours.

I was planning to customize the num_heartbeats_timeout and heartbeat_timeout_milliseconds in ray, but this requires a bit more customization in the source code so I haven't successfully customized them yet.

Thoughts I am wondering whether running the initialization in parallel would help increase the speed and how I should do it.

The failure of the job caused by the heartbeat_timeout make me curious how to customize this parameter in calculate_TFs_to_genes_relationships. I tried to pass the hearbeat related parameters directly to this function but they were not recognized by ray.

My final goal is just to run this function faster on my dataset and the biggest memory I can use is 750GB. Please let me know if I there is any other information that can be useful to you.

Thanks, Ethan

EthanZhaoChem commented 1 year ago

Hi @cbravo93,

I still can't run this function successfully on my dataset, it always fails during the initialization step because it is very memory intensive and time consuming in the for-loop.

Is there any suggestion? Looking forward to your reply.

SeppeDeWinter commented 1 year ago

Hi @EthanZhaoChem

You are right, this is indeed the most memory and cpu intensive step of the workflow. Unfortunately the initialisation step, as it is written now, can not be parallelised. We hope to improve the efficiency of this step in the future though.

That being said, I think the errors you're having is because you are running out of memory (see: Spilled 5324 MiB, 12 objects, write throughput 1262 MiB/s). From the moment you run out of memory ray will start writing objects to disk (in the _temp_dir directory) which slows down the process by a lot.

Were you able to give it a try on a system with 750 GB? That should be plenty of memory (all our benchmarks were also done on systems with similar amounts of memory).

Also, I would not set the object_store_memory parameter. By default ray will use all available memory, by setting this parameter you're only limiting the amount that can be used.

Hope this helps.

Best,

Seppe

EthanZhaoChem commented 1 year ago

Hi @SeppeDeWinter,

Thank you for making it clear. I will try using a smaller sample size to run on the maximum memory.

Best,

Ethan

yupingz commented 1 year ago

I have the same question. This is the size of my data SCENIC+ object with n_cells x n_genes = 2165 x 18113 and n_cells x n_regions = 2165 x 258529

And it has been running for more than 48 hrs with 20 cores. Any suggestion how to speed up run_scenicplus?

dburkhardt commented 1 year ago

@yupingz and @EthanZhaoChem , I created a new issue because it seems this is still not fixed https://github.com/aertslab/scenicplus/issues/109