Stuck at GSEA Step - Githubissues

li-xuyang28 commented 1 year ago

I am following the 10X Genomics PBMC tutorial and running the wrapper function. Everything was fine until the GSEA step, it has been stuck for over 40 hours

2023-04-26 17:32:13,593 GSEA         INFO     Subsetting TF2G adjacencies for TF with motif.
2023-04-26 17:32:19,727 INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
2023-04-26 17:32:20,376 GSEA         INFO     Running GSEA...
initializing:  23%|██▎       | 7094/31183 [23:37<05:21, 74.95it/s]

When looking at the node log, it does raise an error message about node overloaded, terminated or the network is slow. But the memory usage showing in the cluster is well below 10%

204692023-04-27 13:24:41,619    ERROR node_head.py:302 -- Cannot reach the node, c96dd5c6ab1a61bc93d4ee80eff792af1b8762ed22e5afa5eb6cbef5, after timeout 4. This node may have been overloaded, terminated, or the network is slow.20470NoneType: None204712023-04-27 13:24:48,627  ERROR node_head.py:302 -- Cannot reach the node, c96dd5c6ab1a61bc93d4ee80eff792af1b8762ed22e5afa5eb6cbef5, after timeout 4. This node may have been overloaded, terminated, or the network is slow.20472NoneType: None204732023-04-27 13:24:51,920  INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:24:51 +0000] 'GET /nodes?view=summary HTTP/1.1' 200 9532 bytes 6260 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'204742023-04-27 13:24:51,923    INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:24:51 +0000] 'GET /nodes/c96dd5c6ab1a61bc93d4ee80eff792af1b8762ed22e5afa5eb6cbef5 HTTP/1.1' 200 9871 bytes 1948 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'204752023-04-27 13:24:54,614    INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:24:54 +0000] 'GET /log_index HTTP/1.1' 200 391 bytes 43230 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'204762023-04-27 13:24:55,633 ERROR node_head.py:302 -- Cannot reach the node, c96dd5c6ab1a61bc93d4ee80eff792af1b8762ed22e5afa5eb6cbef5, after timeout 4. This node may have been overloaded, terminated, or the network is slow.20477NoneType: None204782023-04-27 13:24:56,226  INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:24:56 +0000] 'GET /log_proxy?url=http%3A%2F%2F127.0.0.1%3A52365%2Flogs HTTP/1.1' 200 3130 bytes 103802 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'204792023-04-27 13:24:58,168 INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:24:58 +0000] 'GET /log_proxy?url=http%3A%2F%2F127.0.0.1%3A52365%2Flogs%2Fdashboard.err HTTP/1.1' 200 660 bytes 8014 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'204802023-04-27 13:25:01,210    INFO web_log.py:206 -- 127.0.0.1 [27/Apr/2023:17:25:01 +0000] 'GET /log_proxy?url=http%3A%2F%2F127.0.0.1%3A52365%2Flogs HTTP/1.1' 200 3130 bytes 5785 us 'http://127.0.0.1:8265/' 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'20481

There seem to be activities going on in the cluster based on ray dashboard, but it has been stuck at 7094/31183 and I couldn't figure out why it is taking 40h+.

SeppeDeWinter commented 1 year ago

Hi @li-xuyang28

hmm... 40hrs is really very long..

How many cores were you using?

Best,

Seppe

li-xuyang28 commented 1 year ago

Hi,

I'm running on 8 cores (less than the 12 suggested by the tutorial), was running it locally (on an iMAC, because I was having so much trouble getting ray to work on the cluster I have access to). Somehow the entire thing was just extremely slow for me. (I restated the process but still stuck at the step hmmm)

This was the 10X PBMC multiome data, but I did change the cell type annotation a bit (divided into a bit more T cell subtypes).

Best, Yang

li-xuyang28 commented 1 year ago

Hi again @SeppeDeWinter ,

I tried subsetting the object to run the build_grn function several times with the 10X PBMC data, but it all got stuck during initializing (at around 16166/18918); it takes about 4 minutes to go through the ones that were processed (consistent with the tutorial), then was forever stuck (>24h). According to ray dashboard there were still activities going on, but the nodes seemed to be idle. Is there any information I could provide to help with figuring out what happened with it?

Best, Yang

ghuls commented 1 year ago

There might be a chance that one of the worker processes was crashed and that ray didn't detect it and assumes it is still running. Try with less cores or with a better machine.

CYorick commented 1 year ago

Hi, is the problem solved? I also met the problem

SeppeDeWinter commented 1 year ago

Hi @CYorick

It might be memory related. The code in the development branch is more memory friendly.

See https://github.com/aertslab/scenicplus/discussions/202 on how to use it.

All the best,

Seppe

CYorick commented 1 year ago

Hi @CYorick

It might be memory related. The code in the development branch is more memory friendly.

See #202 on how to use it.

All the best,

Seppe

Thanks for your reply. Should I simply download the Snakemake dictionary without changing anything else, and run the whole pipeline automatically? What if I just want to run the function build_grn?

Best, Yorick

CYorick commented 1 year ago

The problem can be solved by setting the "ray_n_cpu" as None

rogercasalsfr commented 11 months ago

What do you mean set "ray_n_cpu" as None? Using a single core?

I've tried to solve at it says , clean the temporal directory and re-run the code. But it has been impossible, and I have 600 GB of space. Here are the errors that appear me.


(_ray_run_gsea_for_e_module pid=959428) /home/roger/anaconda3/envs/scenicplus/lib/python3.8/site-packages/gseapy/algorithm.py:87: RuntimeWarning: divide by zero encountered in divide
(_ray_run_gsea_for_e_module pid=959428)   norm_tag = 1.0 / sum_correl_tag
(_ray_run_gsea_for_e_module pid=959428) /home/roger/anaconda3/envs/scenicplus/lib/python3.8/site-packages/gseapy/algorithm.py:91: RuntimeWarning: invalid value encountered in multiply
(_ray_run_gsea_for_e_module pid=959428)   tag_indicator * correl_vector * norm_tag - no_tag_indicator * norm_no_tag,
(raylet) Spilled 5732 MiB, 13998 objects, write throughput 2560 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.```

If anybody has encountered the same issue or could help me, would be great. 

Thank you.

SeppeDeWinter commented 10 months ago

Hi @rogercasalsfr

I would also suggest to use the development version of the code. See https://github.com/aertslab/scenicplus/discussions/202 for more info.

All the best,

Seppe

aertslab / scenicplus

Stuck at GSEA Step #148