[BUG] grn problem about running long jobs on HPC using scheduled manager "SLURM"

crotoc commented 1 year ago

Describe the bug A clear and concise description of what the bug is. I am encountering a very wired problem about running the pyscenic on a HPC, on which the jobs are submitted using slurm. The command line I am using is like following:

singularity run -e  --bind /home,/nobackup pyscenic.sif pyscenic grn GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.subsample.pro.hg38_500bp_up_100bp_down.convert.loom allTFs_hg38.txt -o GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.subsample.pro.hg38_500bp_up_100bp_down.adj.csv --num_workers 50

If I run it on a local machine, it gets the adj.csv file. but if it's submitted to a HPC, an error is throwed out. I tried many different datasets and if it runs more than one days, the problem will happen. I think it's a problem related to multiprocessing.

I have to run pyscenic on big dataset even subsampling results in big enough dataset. Please help have a check what's the cause of this problem and whether it can be fixed by some ways or workaround.

I hope grn step can save the intermediate results and if it throws error, I can continue based on the intermediate files. That would help a ton.

Error encountered:
2023-06-24 21:37:49,386 - pyscenic.cli.pyscenic - INFO - Inferring regulatory networks.
2023-06-25 02:36:08,769 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:43171 (pid=56712) exceeded 95% memory budget. Restarting...
2023-06-25 02:36:11,646 - distributed.nanny - WARNING - Restarting worker
2023-06-25 07:19:02,055 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:43134 (pid=56747) exceeded 95% memory budget. Restarting...
2023-06-25 07:19:03,801 - distributed.nanny - WARNING - Restarting worker
2023-06-25 12:01:20,640 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:45275 (pid=56685) exceeded 95% memory budget. Restarting...
2023-06-25 12:01:23,892 - distributed.nanny - WARNING - Restarting worker
2023-06-25 16:42:20,931 - distributed.worker_memory - WARNING - Worker tcp://127.0.0.1:33016 (pid=56806) exceeded 95% memory budget. Restarting...
2023-06-25 16:42:21,414 - distributed.nanny - WARNING - Restarting worker
2023-06-25 16:42:22,736 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/venv/lib/python3.10/site-packages/distributed/worker.py", line 1214, in heartbeat
response = await retry_operation(
File "/opt/venv/lib/python3.10/site-packages/distributed/utils_comm.py", line 386, in retry_operation
return await retry(
File "/opt/venv/lib/python3.10/site-packages/distributed/utils_comm.py", line 371, in retry
return await coro()
File "/opt/venv/lib/python3.10/site-packages/distributed/core.py", line 1163, in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
File "/opt/venv/lib/python3.10/site-packages/distributed/core.py", line 928, in send_recv
response = await comm.read(deserializers=deserializers)
File "/opt/venv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 241, in read
convert_stream_closed_error(self, e)
File "/opt/venv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:47746 remote=tcp://127.0.0.1:41346>: Stream is closed
2023-06-25 16:42:25,947 - distributed.nanny - WARNING - Worker process still alive after 3.199999694824219 seconds, killing
Traceback (most recent call last):
File "/opt/venv/bin/pyscenic", line 33, in
sys.exit(load_entry_point('pyscenic==0.12.1+0.gce41b61.dirty', 'console_scripts', 'pyscenic')())
File "/opt/venv/lib/python3.10/site-packages/pyscenic/cli/pyscenic.py", line 713, in main
args.func(args)
File "/opt/venv/lib/python3.10/site-packages/pyscenic/cli/pyscenic.py", line 106, in find_adjacencies_command
network = method(
File "/opt/venv/lib/python3.10/site-packages/arboreto/algo.py", line 39, in grnboost2
return diy(expression_data=expression_data, regressor_type='GBM', regressor_kwargs=SGBM_KWARGS,
File "/opt/venv/lib/python3.10/site-packages/arboreto/algo.py", line 134, in diy
return client \
File "/opt/venv/lib/python3.10/site-packages/distributed/client.py", line 3337, in compute
result = self.gather(futures)
File "/opt/venv/lib/python3.10/site-packages/distributed/client.py", line 2291, in gather
return self.sync(
File "/opt/venv/lib/python3.10/site-packages/distributed/utils.py", line 339, in sync
return sync(
File "/opt/venv/lib/python3.10/site-packages/distributed/utils.py", line 406, in sync
raise exc.with_traceback(tb)
File "/opt/venv/lib/python3.10/site-packages/distributed/utils.py", line 379, in f
result = yield future
File "/opt/venv/lib/python3.10/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/opt/venv/lib/python3.10/site-packages/distributed/client.py", line 2154, in _gather
raise exception.with_traceback(traceback)
distributed.scheduler.KilledWorker: Attempted to run task finalize-be2b205c2cd7979777e3969dc9a3a895 on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://127.0.0.1:33016. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information s\ ee https://distributed.dask.org/en/stable/killed.html.

Expected behavior GRN go through smoothly

ghuls commented 11 months ago

Try with less threads or request a node with more memory. exceeded 95% memory budget. Restarting... indicates that you ran out of memory and that those subprocesses got killed (and restarted0. How much memory does one one node in your HPC have?

crotoc commented 11 months ago

Try with less threads or request a node with more memory. exceeded 95% memory budget. Restarting... indicates that you ran out of memory and that those subprocesses got killed (and restarted0. How much memory does one one node in your HPC have?

I applied 60 cores and 150 gb total memory. Do you think it's not enough?

tbrunetti commented 9 months ago

I get this error consistently on SLURM as well but mine is only a heartbeat error, nothing to do with memory or resources. Interestingly, it only happens with some seeds. For example, I have a really large data object and I submitted 100 parallel runs (for robustness and reproducibility) and I would say about 10-20% of those runs throw this error on the exact same data set with only the seed ID changing but the other 80% do not. I always re-run them with the same failed seed ID and when I do, most of the time they finish without the error. Not sure why but maybe some type of network issue for me although not quite the same problem @crotoc is experiencing since there is not a resource issue error thrown... My guess is despite pyscenic producing all the output as expected when this error is thrown, I shouldn't trust it? I have always been re-running to just be safe

2023-09-16 05:33:54,639 - pyscenic.cli.pyscenic - INFO - Loading expression matrix.

2023-09-16 05:38:21,261 - pyscenic.cli.pyscenic - INFO - Inferring regulatory networks.
2023-09-16 12:01:20,777 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/distributed/worker.py", line 1214, in heartbeat
    response = await retry_operation(
  File "/opt/venv/lib/python3.10/site-packages/distributed/utils_comm.py", line 386, in retry_operation
    return await retry(
  File "/opt/venv/lib/python3.10/site-packages/distributed/utils_comm.py", line 371, in retry
    return await coro()
  File "/opt/venv/lib/python3.10/site-packages/distributed/core.py", line 1163, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/distributed/core.py", line 928, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/opt/venv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/opt/venv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:44662 remote=tcp://127.0.0.1:33811>: Stream is closed
2023-09-16 12:01:20,781 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.10/site-packages/distributed/worker.py", line 1214, in heartbeat
    response = await retry_operation(
  File "/opt/venv/lib/python3.10/site-packages/distributed/utils_comm.py", line 386, in retry_operation
    return await retry(
  File "/opt/venv/lib/python3.10/site-packages/distributed/utils_comm.py", line 371, in retry
    return await coro()
  File "/opt/venv/lib/python3.10/site-packages/distributed/core.py", line 1163, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/venv/lib/python3.10/site-packages/distributed/core.py", line 928, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/opt/venv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/opt/venv/lib/python3.10/site-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:44668 remote=tcp://127.0.0.1:33811>: Stream is closed

plijnzaad commented 4 months ago

Having exact same problem here, also slurm, NFS disks, 64 GB, 8 cores, ~900 cells

aertslab / pySCENIC

[BUG] grn problem about running long jobs on HPC using scheduled manager "SLURM" #485