error "failed multiprocessing" at the step rawreads__tan_apply

fengli-eGen commented 2 years ago

Hi, I'm running falcon through smrtlink pbcromwell hgap4 on AWS on an instance with 64 vCPU and 128G RAM.

The genome size I'm analyzing is ~2.8GB and PacBio CLR reads cover the genome ~170X. SMRTlink v10.2 was installed and used.

It got stuck at Falcon daligner step when it was trying to run multiprocessing. It looks like it generated ~100 shared-XX folders at the step of call-task0_rawreadstan_apply: for example, /data2/falcon_test/cromwell_out/cromwell-executions/pb_hgap4/c85def77-cd12-450e-9062-3238e03d1c6c/call-falcon/falcon/e47b1c37-3cd2-4f4e-abe6-71a4140ca1f4/call-task0_rawreadstan_apply/shared-XX. They have similar errors in stderr and I attach an example under shared-21/execution/stderr, stdout and script in this execution.

+ python3 -m falcon_kit.mains.cromwell_run_uows_tar --nproc=4 --nproc-per-uow=4 --uows-tar-fn=/data2/falcon_test/cromwell_out/cromwell-executions/pb_hgap4/c85def77-cd12-450e-9062-3238e03d1c6c/call-falcon/falcon/e47b1c37-3cd2-4f4e-abe6-71a4140ca1f4/call-task__0_rawreads__tan_apply/shard-21/inputs/-388349761/some-units-of-work.21.tar --tool=datander

falcon-kit 1.8.1 (pip thinks "falcon-kit 1.8.1+git.449fe5cb421c39a39795b4889d6ba47d459dfc9d")
pypeflow 2.3.0+git.03eda6364441793b24845ef5b8d1ef8c58ce1cf4
INFO:root:For multiprocessing, parallel njobs=1 (cpu_count=64, nproc=4, nproc_per_uow=4)
INFO:root:$('tar --strip-components=1 -xvf /data2/falcon_test/cromwell_out/cromwell-executions/pb_hgap4/c85def77-cd12-450e-9062-3238e03d1c6c/call-falcon/falcon/e47b1c37-3cd2-4f4e-abe6-71a4140ca1f4/call-task__0_rawreads__tan_apply/shard-21/inputs/-388349761/some-units-of-work.21.tar')
INFO:root:Started a worker in 81079 from parent 80991
INFO:root:running 3 units-of-work, 1 at a time...
[81079]starting run_uow('./uow-0063')
[81079]maxrss:    21988
INFO:root:CD: './uow-0063' <- '/data2/falcon_test/cromwell_out/cromwell-executions/pb_hgap4/c85def77-cd12-450e-9062-3238e03d1c6c/call-falcon/falcon/e47b1c37-3cd2-4f4e-abe6-71a4140ca1f4/call-task__0_rawreads__tan_apply/shard-21/execution'
INFO:root:$('bash -vex uow.sh')
datander -v -P. raw_reads.253 raw_reads.254 raw_reads.255 raw_reads.256
+ datander -v -P. raw_reads.253 raw_reads.254 raw_reads.255 raw_reads.256
uow.sh: line 1: 81085 Killed                  datander -v -P. raw_reads.253 raw_reads.254 raw_reads.255 raw_reads.256
WARNING:root:Call 'bash -vex uow.sh' returned 35072.
INFO:root:CD: './uow-0063' -> '/data2/falcon_test/cromwell_out/cromwell-executions/pb_hgap4/c85def77-cd12-450e-9062-3238e03d1c6c/call-falcon/falcon/e47b1c37-3cd2-4f4e-abe6-71a4140ca1f4/call-task__0_rawreads__tan_apply/shard-21/execution'
[81079]starting run_uow('./uow-0064')
[81079]maxrss:    22236
INFO:root:CD: './uow-0064' <- '/data2/falcon_test/cromwell_out/cromwell-executions/pb_hgap4/c85def77-cd12-450e-9062-3238e03d1c6c/call-falcon/falcon/e47b1c37-3cd2-4f4e-abe6-71a4140ca1f4/call-task__0_rawreads__tan_apply/shard-21/execution'
**INFO:root:$('bash -vex uow.sh')
ERROR:root:failed multiprocessing
multiprocessing.pool.RemoteTraceback:**
"""
Traceback (most recent call last):
  File "/data2/src/smrtlink/install/smrtlink-release_10.2.0.133434/bundles/smrttools/install/smrttools-release_10.2.0.133434/private/thirdparty/python3/python3_3.9.6/site-packages/falcon_kit/util/io.py", line 68, in run_func
    ret = func(*args)
  File "/data2/src/smrtlink/install/smrtlink-release_10.2.0.133434/bundles/smrttools/install/smrttools-release_10.2.0.133434/private/thirdparty/python3/python3_3.9.6/site-packages/falcon_kit/mains/cromwell_run_uows_tar.py", line 18, in run_uow
    io.syscall(cmd)
  File "/data2/src/smrtlink/install/smrtlink-release_10.2.0.133434/bundles/smrttools/install/smrttools-release_10.2.0.133434/private/thirdparty/python3/python3_3.9.6/site-packages/pypeflow/io.py", line 27, in syscall
    raise Exception(msg)
Exception: Call 'bash -vex uow.sh' returned 35072.
uow.sh: line 1: 76103 Killed                  datander -v -P. raw_reads.233 raw_reads.234 raw_reads.235 raw_reads.236

When I was troubleshooting, I tried to run this cmd for just this one share-21 and it completed without this error. Cmd used for this troubleshoting:

python3 -m falcon_kit.mains.cromwell_run_uows_tar --nproc=4 --nproc-per-uow=4 --uows-tar-fn=/data2/falcon_test/cromwell_out/cromwell-executions/pb_hgap4/c85def77-cd12-450e-9062-3238e03d1c6c/call-falcon/falcon/e47b1c37-3cd2-4f4e-abe6-71a4140ca1f4/call-task__0_rawreads__tan_apply/shard-21/inputs/-388349761/some-units-of-work.21.tar --tool=datander

I searched around and was not able to figure this out. I wonder if you have some ideas. I checked if multiprocessing was installed successfully in smrtlink_v10.2 python3 (I tested it by running from multiprocessing import Pool from python3), and it was indeed installed. But not sure why the program couldn’t make it run.

CMD kicked off for assembly (since this step got stuck at falcon within HGAP4 pipeline, I think it might be an error of falcon)

$SMRT_ROOT/smrtcmds/bin/pbcromwell run pb_hgap4 \
      -e /data2/raw_reads/subreadset.xml \
      --task-option hgap4_aggressive_asm=False \
       --task-option hgap4_genome_length=2800000000 \
      --task-option hgap4_seed_coverage=45 \ # because I used genome size 2G here, I asked for deeper seed coverage to do the assembly
      --task-option hgap4_seed_length_cutoff=-1 \
      --task-option hgap4_falcon_advanced="" \
      --task-option consensus_algorithm="arrow" \
      --task-option dataset_filters="" \
      --task-option downsample_factor=0 \
      --task-option mapping_min_concordance=70.0 \
      --task-option mapping_min_length=50 \
      --task-option mapping_biosample_name="" \
      --task-option mapping_pbmm2_overrides="" \
      --task-option consolidate_aligned_bam=False \
      --config /data2/falcon_test/local.cromwell.conf \
      --nproc 64 1>falcon-assembly-wt.stdout 2>falcon-assembly-wt.stderr

I wonder if you know what might go wrong and why multiprocessing is not working.

Thank you so much!

gconcepcion commented 2 years ago

Hello,

datander was killed, which generally indicates a memory issue. if you can allocate more memory to the process, I would advise to do so. uow.sh: line 1: 76103 Killed datander -v -P. raw_reads.233 raw_reads.234 raw_reads.235 raw_reads.236

Also 170X CLR is probably much more coverage than necessary. Downsampling to <100X might also help with your issue.

See the damasker repo for more information on datander: https://github.com/thegenemyers/DAMASKER

fengli-eGen commented 2 years ago

Thank you for your quick reply! I'll increase my memory. Do you have any recommendations for downsampling to <100X? I'm wondering if I can downsample to keep the longer reads.

gconcepcion commented 2 years ago

If your reads are in fasta format, I recommend seqtk to subsample: https://www.biostars.org/p/110107/#110248

If they are still in bam format, you can use samtools with the -s parameter to subsample http://www.htslib.org/doc/samtools-view.html

You can't subsample & filter reads at the same time as far as i'm aware. you'll need to get a list of readlengths and calculate for yourself the percent you want to retain in order to hit a certain coverage / read length threshold

fengli-eGen commented 2 years ago

Hi! Since I was trying hgap4 to run falcon, I found that hgap4 in smrtlink provided a downsampling option. To do a test, I downsampled a lot so that only 1/6 of the raw data was used (please see the code below: --task-option downsample_factor=6). However, with this low amount of reads, I still got the same error related to multiprocessing issue. Thus, I think this might not be related to the large amount data I used initially?

Here is the cmd I ran:

$SMRT_ROOT/smrtcmds/bin/pbcromwell run pb_hgap4 \
      -e /data2/raw_reads/subreadset.xml \
      --task-option hgap4_aggressive_asm=False \
       --task-option hgap4_genome_length=2000000000 \
      --task-option hgap4_seed_coverage=30 
      --task-option hgap4_seed_length_cutoff=-1 \
      --task-option hgap4_falcon_advanced="" \
      --task-option consensus_algorithm="arrow" \
      --task-option dataset_filters="" \
      --task-option downsample_factor=6 \
      --task-option mapping_min_concordance=70.0 \
      --task-option mapping_min_length=50 \
      --task-option mapping_biosample_name="" \
      --task-option mapping_pbmm2_overrides="" \
      --task-option consolidate_aligned_bam=False \
      --config /data2/falcon_test/local.cromwell.conf \
      --nproc 64

Error message in downsample1/cromwell_out/cromwell-executions/pb_hgap4/6ef4ad19-0ce6-4b40-b5f7-85759ca5e73a/call-falcon/falcon/a6452512-46a6-4c9d-9318-84d8ea229e58/call-task__0_rawreads__tan_apply/shard-16/execution/stderr

+ python3 -m falcon_kit.mains.cromwell_run_uows_tar --nproc=4 --nproc-per-uow=4 --uows-tar-fn=/data2/falcon_test/downsample1/cromwell_out/cromwell-executions/pb_hgap4/6ef4ad19-0ce6-4b40-b5f7-85759ca5e73a/ca
ll-falcon/falcon/a6452512-46a6-4c9d-9318-84d8ea229e58/call-task__0_rawreads__tan_apply/shard-16/inputs/-461551164/some-units-of-work.16.tar --tool=datander
falcon-kit 1.8.1 (pip thinks "falcon-kit 1.8.1+git.449fe5cb421c39a39795b4889d6ba47d459dfc9d")
pypeflow 2.3.0+git.03eda6364441793b24845ef5b8d1ef8c58ce1cf4
INFO:root:For multiprocessing, parallel njobs=1 (cpu_count=64, nproc=4, nproc_per_uow=4)
INFO:root:$('tar --strip-components=1 -xvf /data2/falcon_test/downsample1/cromwell_out/cromwell-executions/pb_hgap4/6ef4ad19-0ce6-4b40-b5f7-85759ca5e73a/call-falcon/falcon/a6452512-46a6-4c9d-9318-84d8ea229e
58/call-task__0_rawreads__tan_apply/shard-16/inputs/-461551164/some-units-of-work.16.tar')
INFO:root:Started a worker in 55894 from parent 55879
INFO:root:running 1 units-of-work, 1 at a time...
[55894]starting run_uow('./uow-0016')
[55894]maxrss:    22012
INFO:root:CD: './uow-0016' <- '/data2/falcon_test/downsample1/cromwell_out/cromwell-executions/pb_hgap4/6ef4ad19-0ce6-4b40-b5f7-85759ca5e73a/call-falcon/falcon/a6452512-46a6-4c9d-9318-84d8ea229e58/call-task
__0_rawreads__tan_apply/shard-16/execution'
INFO:root:$('bash -vex uow.sh')
datander -v -P. raw_reads.65 raw_reads.66 raw_reads.67 raw_reads.68
+ datander -v -P. raw_reads.65 raw_reads.66 raw_reads.67 raw_reads.68
uow.sh: line 1: 55900 Killed                  datander -v -P. raw_reads.65 raw_reads.66 raw_reads.67 raw_reads.68
WARNING:root:Call 'bash -vex uow.sh' returned 35072.
INFO:root:CD: './uow-0016' -> '/data2/falcon_test/downsample1/cromwell_out/cromwell-executions/pb_hgap4/6ef4ad19-0ce6-4b40-b5f7-85759ca5e73a/call-falcon/falcon/a6452512-46a6-4c9d-9318-84d8ea229e58/call-task
__0_rawreads__tan_apply/shard-16/execution'
**ERROR:root:failed multiprocessing
multiprocessing.pool.RemoteTraceback:**
"""
Traceback (most recent call last):
  File "/data2/src/smrtlink/install/smrtlink-release_10.2.0.133434/bundles/smrttools/install/smrttools-release_10.2.0.133434/private/thirdparty/python3/python3_3.9.6/site-packages/falcon_kit/util/io.py", li
ne 68, in run_func
    ret = func(*args)
  File "/data2/src/smrtlink/install/smrtlink-release_10.2.0.133434/bundles/smrttools/install/smrttools-release_10.2.0.133434/private/thirdparty/python3/python3_3.9.6/site-packages/falcon_kit/mains/cromwell_
run_uows_tar.py", line 18, in run_uow
    io.syscall(cmd)
  File "/data2/src/smrtlink/install/smrtlink-release_10.2.0.133434/bundles/smrttools/install/smrttools-release_10.2.0.133434/private/thirdparty/python3/python3_3.9.6/site-packages/pypeflow/io.py", line 27,
in syscall
    raise Exception(msg)
Exception: Call 'bash -vex uow.sh' returned 35072.

I wonder if the error may be related to the configuration file since it's local instead of sge on AWS.. Here is the configuration I used. File: /data2/falcon_test/local.cromwell.conf attached config here. local.cromwell.conf.txt

Thank you so much for your advice!

PacificBiosciences / FALCON

error "failed multiprocessing" at the step rawreads__tan_apply #720