chapmanb / cloudbiolinux

CloudBioLinux: configure virtual (or real) machines with tools for biological analyses
http://cloudbiolinux.org
MIT License
257 stars 158 forks source link

a peculiar bug in grch37 dbnsftp recipe #296

Closed naumenko-sa closed 5 years ago

naumenko-sa commented 5 years ago

Hello, cloudbiolinux community!

When installing bcbio_nextgen 1.1.5 from scratch, installing --datatarget dbnsfp failed for me.

Running GGD recipe: GRCh37 dbnsfp 3.5a
Traceback (most recent call last):
  File "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/anaconda/bin/bcbio_nextgen.py", line 221, in <module>
    install.upgrade_bcbio(kwargs["args"])
  File "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 106, in upgrade_bcbio
    upgrade_bcbio_data(args, REMOTES)
  File "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/anaconda/lib/python3.6/site-packages/bcbio/install.py", line 348, in upgrade_bcbio_data
    args.cores, ["ggd", "s3", "raw"])
  File "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 354, in install_data_local
    _prep_genomes(env, genomes, genome_indexes, ready_approaches, data_filedir)
  File "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 480, in _prep_genomes
    retrieve_fn(env, manager, gid, idx)
  File "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/genomes.py", line 850, in _install_with_ggd
    ggd.install_recipe(os.getcwd(), env.system_install, recipe_file, gid)
  File "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 30, in install_recipe
    recipe["recipe"]["full"]["recipe_type"], system_install)
  File "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/tmpbcbio-install/cloudbiolinux/cloudbio/biodata/ggd.py", line 62, in _run_recipe
    subprocess.check_output(["bash", run_file])
  File "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/anaconda/lib/python3.6/subprocess.py", line 336, in check_output
    **kwargs).stdout
  File "/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/anaconda/lib/python3.6/subprocess.py", line 418, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['bash', '/hpf/largeprojects/ccmbio/naumenko/tools/bcbio_1.1.5/genomes/Hsapiens/GRCh37/txtmp/ggd-run.sh']' returned non-zero exit status 141.

I was able to track down the problem in the recipe to that line:

echo "test0"
unzip -p dbNSFPv*.zip "dbNSFP*_variant.chr1" | head -n1 > $UNPACK_DIR/header.txt
echo "test1"

header.txt is created, but then script fails without any messages, test1 is not printed.

 echo "test0"
 unzip dbNSFPv*.zip "dbNSFP*_variant.chrM"
 head -n1 dbNSFP*_variant.chrM > $UNPACK_DIR/header.txt
 echo "test1"

works well.

I've changed the recipe for grch37 and hg38 accordingly here https://github.com/chapmanb/cloudbiolinux/pull/295

Sergey

chapmanb commented 5 years ago

Sergey; Thanks for the diagnosis and the fix. I'm confused as to why that step would fail when run in a pipe but work cleanly when done separately. Do you have an idea of what is failing on your system exactly? The only thing I'm concerned about in your change is the impact of doing a larger unzip on runtime and filesystem sizes. That particular line was trying to avoid unpacking much of the file and just grabbing a single header line that we need. If the workaround doesn't create anything large on disk and also works quickly that works great with me as well. Thanks again.

naumenko-sa commented 5 years ago

Thanks Brad!

It is a puzzle why it is not working, my system is pretty standard. Something with the pipe buffer (which was increased in the latest bash) overflow, while head -n1 needs just a bit of information, i.e. pipe synchronization issue? The fix does not unpack the huge file - you see I've changed chr1 to chrM. It is quite small. Interestingly, the larger pipe down the script which processes all the huge files works well (but it processes every line of the files, not just head -n1).

S.

naumenko-sa commented 5 years ago

Found a typo which I introduced in dbnsftp recipe. Sorry about that. Please merge. https://github.com/chapmanb/cloudbiolinux/pull/297