dahak-metagenomics / dahak

benchmarking and containerization of tools for analysis of complex non-clinical metagenomes.
https://dahak-metagenomics.github.io/dahak
BSD 3-Clause "New" or "Revised" License
21 stars 4 forks source link

Kaiju does not output files in taxonomic workflow #55

Closed charlesreid1 closed 6 years ago

charlesreid1 commented 6 years ago

Am currently working through the taxonomic workflow as part of #45 (see this fork of dahak for a better-formatted version of the taxonomic classification workflow and the scripts folder of the dahak-yeti repository containing scripts for running dahak on AWS nodes), and have made it nearly all the way through the workflow. However, I am experiencing an issue with the Kaiju container. It is supposed to output a file used by the next step in the workflow, but no file is being output.

Command being run:

    docker run -v ${PWD}:/data quay.io/biocontainers/kaiju:1.5.0--pl5.22.0_0 \
        kaiju \
        -x \
        -v \
        -t /data/kaijudb/nodes.dmp \
        -f /data/kaijudb/kaiju_db_nr_euk.fmi \
        -i /data/${base}_1.trim2.fq.gz \
        -j /data/${base}_2.trim2.fq.gz \
        -o /data/${base}.kaiju_output.trim2.out \
        -z 4

where ${base} is something like SR606249.

Expected behavior

-o flag indicates this should output a file:

        -o /data/${base}.kaiju_output.trim2.out

Actual behavior

No file is output by the process. I see the following messages printed by the container while it is running:

11:36:49 Reading database
 Reading taxonomic tree from file /data/kaijudb/nodes.dmp
 Reading index from file data/kaijudb/kaiju_db_nr_euk.fmi

11:37:03 Reading database
 Reading taxonomic tree from file /data/kaijudb/nodes.dmp
 Reading index from file data/kaijudb/kaiju_db_nr_euk.fmi

The input files nodes.dmp and kaiju_db_nr_euk.fmi are both present in the container. There is no output file created.

To debug, I changed the docker run line above to:

    docker run -it --rm -v ${PWD}:/data quay.io/biocontainers/kaiju:1.5.0--pl5.22.0_0 \
        /bin/bash

This gives an interactive prompt inside the container. From there I verified the input files were mounted correctly, and I ran the Kaiju command:

kaiju \ 
    -x \
    -v \
    -t \
    /data/kaijudb/nodes.dmp \ 
    -f /data/kaijudb/kaiju_db_nr_euk.fmi \
    -i /data/${base}_1.trim2.fq.gz \ 
    -j /data/${base}_2.trim2.fq.gz \
    -o /data/${base}.kaiju_output.trim2.out \ 
    -z 4

When I did this, I saw a segmentation fault.

Steps to reproduce the behavior

This is difficult to reproduce, because it requires generating and downloading very large files.

See the workflow steps.

Possible Resolution

I suspect the problem may be with the Kaiju version being used: this points to an old quay.io biocontainer version of Kaiju, quay.io/biocontainers/kaiju:1.5.0--pl5.22.0_0. The most recent version of Kaiju (1.6.1) was recently added to biocontainers via bioconda/bioconda-recipes#7213 so we should probably leverage that somehow.

pmenzel commented 6 years ago

Hi, yes you need to upgrade the Kaiju version. Only from version 1.6.0 can gzip files be read directly by Kaiju. Maybe that caused the segfault.

Or it could be too less RAM, I would recommend 60 GB RAM when using the NR+euk kaiju_db_nr_euk.fmi database.

charlesreid1 commented 6 years ago

Thanks for the input! It sounds like both things were issues, as I only had 32 GB of RAM on the machine I was using.

charlesreid1 commented 6 years ago

Problem was resolved by switching the docker run command to use kaiju 1.6.1, and running on a node with 64 GB ram (instead of 32 GB):

    docker run -v ${PWD}:/data quay.io/biocontainers/kaiju:1.6.1--pl5.22.0_0 \
        kaiju \
        -x \
        -v \
        -t /data/kaijudb/nodes.dmp \
        -f /data/kaijudb/kaiju_db_nr_euk.fmi \
        -i /data/${base}_1.trim2.fq.gz \
        -j /data/${base}_2.trim2.fq.gz \
        -o /data/${base}.kaiju_output.trim2.out \
        -z 4

Thanks again @pmenzel!

pmenzel commented 6 years ago

Hi, good that this problem was solved. Could I bother you with another MacOS problem? ;)

There is a problem in makeDB.sh, where I use option -i for xargs, which does not work in MacOS. Also it's deprecated in Linux, so I want to replace it. It affects two lines in makeDB.sh.

I would like to replace the xargs part of these lines with: xargs -n 1 -P $parallelConversions -IXX gbk2faa.pl XX XX.faa It works in Linux, but need to test it in MacOS too.. Could you please quickly try it? Just run the progenomes option: makeDB.sh -p -v with the modified lines 238 and 256 in makeDB.sh.

That would be great, Peter

charlesreid1 commented 6 years ago

Yes, happy to help. Continued in issue 61's thread.