multiprocessing torch error with call modification

sengelen commented 1 week ago

Dear all

I tested deepsignal3 call_mods with sucess on a sample of 360000 reads but when I launched on the entire run (16M reads) I had this error :

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
  File "/usr/local/bin/deepsignal3", line 33, in <module>
    sys.exit(load_entry_point('deepsignal3==0.1.1', 'console_scripts', 'deepsignal3')())
  File "/usr/local/lib/python3.10/dist-packages/deepsignal3-0.1.1-py3.10.egg/deepsignal3/deepsignal3.py", line 1240, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/deepsignal3-0.1.1-py3.10.egg/deepsignal3/deepsignal3.py", line 43, in main_call_mods
    call_mods(args)
  File "/usr/local/lib/python3.10/dist-packages/deepsignal3-0.1.1-py3.10.egg/deepsignal3/call_modifications.py", line 118, in call_mods
    handle_directory_input(args, input_path, model_path, success_file)
  File "/usr/local/lib/python3.10/dist-packages/deepsignal3-0.1.1-py3.10.egg/deepsignal3/call_modifications.py", line 144, in handle_directory_input
    handle_pod5_input(args, input_path, model_path, success_file, is_dna, is_recursive)
  File "/usr/local/lib/python3.10/dist-packages/deepsignal3-0.1.1-py3.10.egg/deepsignal3/call_modifications.py", line 172, in handle_pod5_input
    mp.spawn(_call_mods_from_pod5_gpu_distributed, args=(world_size, param), nprocs=world_size, join=True)
  File "/usr/local/lib/python3.10/dist-packages/torch-2.5.1-py3.10-linux-x86_64.egg/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/lib/python3.10/dist-packages/torch-2.5.1-py3.10-linux-x86_64.egg/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.10/dist-packages/torch-2.5.1-py3.10-linux-x86_64.egg/torch/multiprocessing/spawn.py", line 184, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated

First of all I thought it was due to big reads so I added a read of 1Mb in my sample of 360000 reads and all worked well. So the problem is not the size of the reads. Do you think it is due to the number of reads ? On the web they say the error is that if the size of the pickled data is > 4096, have you any idea ?

Thank you Stefan

xyfltq commented 5 days ago

Hi，@sengelen. Can you provide more information? You can pay attention to whether the memory overflows during program execution. At present, I am trying to find a solution to this problem. Perhaps you can first try splitting the input pod5 and bam into multiple ones, then perform methylation calls separately, and finally merge the output tsv to see if this problem can be solved.

sengelen commented 5 days ago

Hi @xyfltq. I followed the loaded memory it was big but there was no memory overflows. I ran deepsignal3 in a container on 4 a100 GPU node with 80Gb memory. And yes in my previous workflow (guppy/tombo/deepsignal-plant) I split the data on scaffolds of the reference. So I can do the same thing in order to see if this is a problem of a part of my data or if it is really a problem of the number of reads.

PengNi / deepsignal3

multiprocessing torch error with call modification #5