PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
204 stars 103 forks source link

FALCON runs, Unzip doesn't, but I can't find the error #693

Closed michaeljmetzger closed 4 years ago

michaeljmetzger commented 4 years ago

I have run FALCON aligner and it works successfully (using 0.0.6 and python3), but when I try the FALCON-unzip stage I keep getting the program stopping somewhere in 3-unzip. Sometimes only the "reads" folder is generated, but other times the "reads" folder and the "0-phasing" folder is created (with the exact same conda environment and same commands run from the same working directory). It stops at slightly different places each time, but I never see anything I can identify as an error.

The longest it has gotten was into the 0-phasing section. This is the stderr file from that run. Can anyone identify any problems? At one point near the end it reads: CRITICAL - Error in /opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py with args="{'json_fn': '/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split/task.json',\n 'timeout': 30,\n 'tmpdir': None}"

Could this be the problem? The file it refers to (task.json) exists at that address, so it is not a missing file. I am pasting the whole stderr file below.

Thanks to anyone willing to take a look.

+ python3 -m pwatcher.mains.fs_heartbeat --directory=/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split --heartbeat-file=/home/mmetzger/MELC-2E11/mypwatcher/heartbeats/heartbeat-P768dfd9736eac6 --exit-file=/home/mmetzger/MELC-2E11/mypwatcher/exits/exit-P768dfd9736eac6 --rate=10.0 /bin/bash run.sh
export PATH=$PATH:/bin
+ export PATH=/opt/conda/miniconda2/envs/denovo_asm/bin:/opt/conda/miniconda2/bin:/opt/conda/miniconda2/bin:/opt/conda/miniconda2/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/biotools/supernova-2.1.1:/bin
+ PATH=/opt/conda/miniconda2/envs/denovo_asm/bin:/opt/conda/miniconda2/bin:/opt/conda/miniconda2/bin:/opt/conda/miniconda2/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/biotools/supernova-2.1.1:/bin
cd /home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split
+ cd /home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split
/bin/bash task.sh
+ /bin/bash task.sh
pypeflow 2.3.0
2019-10-09 14:46:36,067 - root - DEBUG - Running "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py /home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split/task.json"
2019-10-09 14:46:36,069 - root - DEBUG - Checking existence of '/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split/task.json' with timeout=30
2019-10-09 14:46:36,069 - root - DEBUG - Loading JSON from '/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split/task.json'
2019-10-09 14:46:36,069 - root - DEBUG - {'bash_template_fn': 'template.sh',
 'inputs': {'ctg_list': '../../reads/ctg_list',
            'pread_ids': '../../reads/dump_pread_ids/pread_ids',
            'pread_to_contigs': '../../reads/pread_to_contigs',
            'rawread_ids': '../../reads/dump_rawread_ids/rawread_ids'},
 'outputs': {'bash_template': 'bash-template.sh',
             'split': 'all-units-of-work.json'},
 'parameters': {'pypeflow_mb': 4000, 'pypeflow_nproc': 1, 'topdir': '../../..'}}
2019-10-09 14:46:36,069 - root - INFO - CD: '/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split' <- '/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split'
2019-10-09 14:46:36,069 - root - DEBUG - Checking existence of '../../reads/ctg_list' with timeout=30
2019-10-09 14:47:06,102 - root - ERROR - Was waiting for '../../reads/ctg_list'
Traceback (most recent call last):
  File "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py", line 69, in wait_for
    _wait_for(fn, timeout)
  File "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py", line 85, in _wait_for
    raise Exception('Timed out waiting for {!r}'.format(fn))
Exception: Timed out waiting for '../../reads/ctg_list'
2019-10-09 14:47:06,103 - root - INFO - CD: '/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split' -> '/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split'
2019-10-09 14:47:06,103 - root - CRITICAL - Error in /opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py with args="{'json_fn': '/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split/task.json',\n 'timeout': 30,\n 'tmpdir': None}"
Traceback (most recent call last):
  File "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py", line 280, in <module>
    main()
  File "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py", line 272, in main
    run(**vars(parsed_args))
  File "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py", line 266, in run
    run_cfg_in_tmpdir(cfg, tmpdir, '.')
  File "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py", line 219, in run_cfg_in_tmpdir
    wait_for(fn)
  File "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py", line 69, in wait_for
    _wait_for(fn, timeout)
  File "/opt/conda/miniconda2/envs/denovo_asm/lib/python3.7/site-packages/pypeflow/do_task.py", line 85, in _wait_for
    raise Exception('Timed out waiting for {!r}'.format(fn))
Exception: Timed out waiting for '../../reads/ctg_list'
pwd
+++ pwd
++ echo 'FAILURE. Running top in /home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split (If you see -terminal database is inaccessible- you are using the python bin-wrapper, so you will not get diagnostic info. No big deal. This process is crashing anyway.)'
++ rm -f top.txt
++ which python
++ which top
++ env -u LD_LIBRARY_PATH top -b -n 1
++ env -u LD_LIBRARY_PATH top -b -n 1
++ pstree -apl

real    0m30.507s
user    0m0.289s
sys 0m0.087s
Namespace(command=['/bin/bash', 'run.sh'], directory='/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split', exit_file='/home/mmetzger/MELC-2E11/mypwatcher/exits/exit-P768dfd9736eac6', heartbeat_file='/home/mmetzger/MELC-2E11/mypwatcher/heartbeats/heartbeat-P768dfd9736eac6', rate=10.0)

cwd:'/home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split'
hostname=Mir
heartbeat_fn='/home/mmetzger/MELC-2E11/mypwatcher/heartbeats/heartbeat-P768dfd9736eac6'
exit_fn='/home/mmetzger/MELC-2E11/mypwatcher/exits/exit-P768dfd9736eac6'
sleep_s=10.0
before setpgid: pid=10012 pgid=9957
 after setpgid: pid=10012 pgid=10012
In cwd: /home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split, Blocking call: '/bin/bash run.sh'
 returned: 256
pb-cdunn commented 4 years ago

You might get more info from the stderr file in /home/mmetzger/MELC-2E11/3-unzip/0-phasing/phasing-split. But basically, here is the problem:

Timed out waiting for '../../reads/ctg_list'

That should be under the 3-unzip directory, so look in 3-unzip/reads and see what happened.

Maybe your environment is out-of-date? Try conda update --all.

If you could help us find the cause, we'd update the code quickly. We have not seen this ourselves.

michaeljmetzger commented 4 years ago

I restarted this run two days ago using pwatcher_type = blocking, instead of fs_based, and surprisingly it has now made it to the 4-polish stage. Perhaps this makes sense with the “timed out” error that you found. The ctg_list file was there in the right directory of that run that had failed, so maybe the fs_based pwatcher was too stringent.

Thanks for your help! I will post with an update when I have further success or failure.

michaeljmetzger commented 4 years ago

Well, things progressed much further than before but may have stalled again. 3-unzip ran in two days, and 4-polish has been running for 18 days now. It went through all the quiver-run files, but about a day ago it appeared to get through all files except one. Then it essentially stopped all running and now seems to be stuck doing nothing after that step. There is a single python3 command running, but it seems to be mostly stopped with only 2-3 minutes of run-time over the last 12 hours (based on htop). Based on htop, the last command still running (with 258M virtual mem, 22040 res, and essentially 0 CPU usage and mem usage) is:

python3 -m falcon_kit.mains.generic_run_units_of_work --nproc=4 --units-of-work-fn=../../quiver-chunks/000172F_009/some-units-of-work.json --bash-template-fn=../../quiver-split/bash-template.sh --results-fn+results.json

I can't find a stder output that hasn't closed out, but when I look at the all.log file I see:

2019-10-31 16:43:44,022 - pypeflow.simple_pwatcher_bridge:325 - DEBUG - N in queue: 1 (max_jobs=10) 2019-10-31 16:43:44,022 - pwatcher.blocking:462 - DEBUG - query(which='list', jobids=<1>)

I am showing the last repeat here, but I have done a tail command looking at the last 20,000 lines and it is all repeats of this about 6 times per second.

Does anyone know how to find out if polish has stalled? Has anyone else had this problem? Is it possible it will resolve itself or as I watching a dead loop?

michaeljmetzger commented 4 years ago

I have found that the stder file in folder quiver-run/000172_9 hasn't completed like the rest of the quiver-run folders. The end of the stder files has:

set +e

I do not see the gcpp process anymore, so my guess is that somehow one of the thousands of gcpp processes over the last three weeks died and I may have to redo the whole thing and hope it doesn't happen again.

Does anyone have any better suggestions? Is there a way for me to manually restart this command?

UPDATE: I found the process: gcpp --algorithm=arrow -x 5 -X 120 -q 0 -j 1 -r ../../../quiver-split/refs/000172F_009/ref.fa aln-000172F_009.bam -o cns.fasta,cns.fastq,cns.vcf But it is asleep (and the .temp versions of the three output files have not changed in two weeks). Not sure why or how to wake it up again.

pb-cdunn commented 4 years ago

This might be a gcpp problem rather than an assembly problem. Would you mind submitting a new Issue in https://github.com/PacificBiosciences/pbbioconda? That will get rid of the unrelated problems above, and maybe someone on the broader team can help you.

michaeljmetzger commented 4 years ago

So I will open the gcpp problem as a new issue, but I should report that the pwatcher_type = blocking seems to have solved the first problem of random failures. That might be useful for others to know.

pb-cdunn commented 4 years ago

Yes, that's good to know, Michael. We're trying to discourage the other.

michaeljmetzger commented 4 years ago

When you say you are trying to discourage the other pwatcher_type, do you know where I could find any updated recommendations like this? The current documentation I have found all say that pwatcher_type = fs_based is the default. https://github.com/PacificBiosciences/pb-assembly#polish

pb-cdunn commented 4 years ago

That link says fs_based is the default, but that's only for historical reasons. It does describe how to use blocking instead.

I plan to drop support for fs_based because I cannot keep up with all the questions. Every system is different. With blocking, you are allowed (actually forced) to configure it yourself.

michaeljmetzger commented 4 years ago

Thanks! That is helpful to know. Do you know who I could talk to that might have the ability to update the pb-assembly documentation page so that it can include this info? I am probably not the only person learning how to use FALCON with only that website as documentation, and I follow defaults unless I see anything recommending otherwise. Alternatively, if you have your recommendations written in some other documentation that I have missed, that would be really helpful. Thanks again!

pb-cdunn commented 4 years ago

You should now be able to add a Wiki page at https://github.com/PacificBiosciences/pb-assembly/wiki . You could help us and others a lot there. We plan to move the README docs to Wiki eventually, but we haven't gotten to that.

cement-head commented 3 years ago

Hey @pb-cdunn @gconcepcion , Is the F1 BULL Tutorial dataset broken with the latest version of FALCON-Unzip (conda updated April 16th, 2021)? Got FALCON to run without issues. Got FALCON-Unzip working, but it crashed at the POLISH/QUIVER stage. Digging around GitHub (here) and looking at the stderr file, I get this message:

set +e
+ set +e
gcpp --algorithm=arrow -x 5 -X 120 -q 0 -j $nproc -r ../../../quiver-split/refs/000000F/ref.fasta aln-000000F.bam            -o ${fasta_fn},${fastq_fn},cns.vcf
+ gcpp --algorithm=arrow -x 5 -X 120 -q 0 -j 1 -r ../../../quiver-split/refs/000000F/ref.fasta aln-000000F.bam -o cns.fasta,cns.fastq,cns.vcf
>|> 20210416 14:38:48.922 -|- FATAL -|- Run -|- 0x2af7ef399300|| -|- gcpp ERROR: [pbbam] chemistry compatibility ERROR: unsupported sequencing chemistry combination:
    binding kit:        100-862-200
    sequencing kit:     100-861-800
    basecaller version: 4.0.0.189308
See SMRT Link release documentation for details about consumables compatibility or contact PacBio Technical Support.

Based on comments made in this issue and other issues on GitHub, am I experiencing an RSII chemistry error with the latest release of FALCON-Unzip?

If so, can you please comment and let me (us) know?

Also, it would be extremely helpful to have an updated Tutorial Dataset made available that is known (tested by PacBio/You) to work with FALCON-Unzip so that we can test our installations, before running our actual novel data. TIA