dahak-metagenomics / dahak

benchmarking and containerization of tools for analysis of complex non-clinical metagenomes.
https://dahak-metagenomics.github.io/dahak
BSD 3-Clause "New" or "Revised" License
21 stars 4 forks source link

problem with kaiju step of taxonomic classification #89

Closed charlesreid1 closed 6 years ago

charlesreid1 commented 6 years ago

Expected behavior

Running kaiju should create a kaiju output file. See the run kaiju step in the walkthrough (also on the snakemake branch of PR #83 here).

Actual behavior

A segmentation fault happens whether the kaiju command is run through Docker or Singularity (using the given commands below). Tested on AWS node, Ubuntu 16.04 Xenial image. Using kaiju version:

'quayurl' : 'quay.io/biocontainers/kaiju',
'version' : '1.6.1--pl5.22.0_0'

Steps to reproduce the behavior

wget -O SRR606249_1.trim30.fq.gz http://osf.io/qtzyk/download

wget -O SRR606249_2.trim30.fq.gz http://osf.io/dumz6/download

curl -O https://s3-us-west-1.amazonaws.com/spacegraphcats.ucdavis.edu/microbe-genbank-sbt-k51-2017.05.09.tar.gz

curl -O https://s3.amazonaws.com/dahak-project-ucdavis/kaiju/kaiju_index_nr_euk.tgz

tar -xzf data/kaiju_index_nr_euk.tgz -C data/ && rm -f data/kaiju_index_nr_euk.tgz

singularity pull kaiju.simg docker://quay.io/biocontainers/kaiju:1.6.1--pl5.22.0_0

singularity exec \
  --home /home/ubuntu/dahak/workflows/taxonomic_classification  \
  kaiju.simg \
  bash -c " set -euo pipefail; \
    kaiju -x -v -t /data/nodes.dmp \
    -f /data/kaiju_db_nr_euk.fmi \
    -i /data/SRR606249_1.trim2.fq.gz \
    -j /data/SRR606249_2.trim2.fq.gz \
    -o /data/SRR606249.kaiju_output.trim2.out -z 4"

(Note that the URL curl -O https://s3.amazonaws.com/dahak-project-ucdavis/kaiju/kaiju_index_nr_euk.tgz is an S3 bucket with the Kaiju .tgz file, which is faster and more polite to download from a bucket than to always download from Kaiju's servers).

(alternatively, you can also use a docker command, following the walkthrough or the dockerSnakefile in the snakemake branch linked to above).

The kaiju program starts, and runs for a minute or two, but always ends with a Segmentation Fault.

This is a bit tricky to debug, given that it depends on so many files, but do you see anything fishy about the kaiju command?

Output from snakemake log:

    Error in rule run_kaiju:
        jobid: 2
        output: data/SRR606249.kaiju_output.trim2.out

RuleException:
CalledProcessError in line 275 of /home/ubuntu/dahak/workflows/taxonomic_classification/Snakefile:
Command 'singularity exec --home /home/ubuntu/dahak/workflows/taxonomic_classification  /home/ubuntu/dahak/workflows/taxonomic_classification/.snakemake/singularity/589ebb255515f9b51548999bdfcbaa4b.simg bash -c " set -euo pipefail;  kaiju -x -v -t /data/nodes.dmp -f /data/kaiju_db_nr_euk.fmi -i /data/SRR606249_1.trim2.fq.gz -j /data/SRR606249_2.trim2.fq.gz -o /data/SRR606249.kaiju_output.trim2.out -z 4 "' returned non-zero exit status 139.
charlesreid1 commented 6 years ago

Additional useful info

This step was working fine 1 month ago. Also, 1 month ago the URL to the kaiju database file led directly to the kaiju database file (kaiju_index_nr_euk.tgz). That link was http://kaiju.binf.ku.dk/database/kaiju_index_nr_euk.tgz

However, since that time, the link above now returns an HTTP 301 (redirect) for the database file. The kaiju database file definitely moved since the script was last working; is it possible this is the cause of the script not working (malformed or different data in the .tgz file)? Or is this a gold standard and not something we would expect to change?

I do not have a copy of the database or a signature of it from 1 mo. ago to check if it is different (note: we should include MD5 sums of data in the walkthroughs/documentation/example workflows).

I have downloaded the current version of kaiju_index_nr_euk.tgz and put it in an AWS S3 bucket, for the reasons mentioned above (faster/more polite) and also to make sure we have a version of it. That link is https://s3.amazonaws.com/dahak-project-ucdavis/kaiju/kaiju_index_nr_euk.tgz

What else I tried

I also tried running the commands interactively from the singularity and docker containers (i.e., getting an interactive shell and copying and pasting the command to make sure the files existed and I didn't have any syntax wrong). These all resulted in Segmentation Faults.

I also tried running with other data files (using different files for -i and -j), for example:

    -i /data/SRR606249_subset10_1.trim2.fq.gz \
    -j /data/SRR606249_subset10_2.trim2.fq.gz \

with the same outcome - Segmentation Fault.

What I did not try

Did not try a different version of kaiju (this worked before with this version, so in principle it should still work).

charlesreid1 commented 6 years ago

~Apparently this is a known issue - kaiju databases are occasionally updated to non-working states.~

Plan:

brooksph commented 6 years ago

Related to https://github.com/dahak-metagenomics/dahak/issues/2 and https://github.com/dahak-metagenomics/dahak/issues/36

brooksph commented 6 years ago

Due to frequent updates in database location or incomplete downloads accompanied by the lack of an error message indicating that the database is the issue it may be useful to 1) include an error message in the workflows indicating the user should redownload the database if the next step fails or 2) force redownload if the next step fails. For the duration of the project it may be worth it to host the database on amazon (@ctb).

ctb commented 6 years ago

On Tue, Jun 12, 2018 at 08:18:51PM +0000, Chaz Reid wrote:

Apparently this is a known issue - kaiju databases are occasionally updated to non-working states.

wat

charlesreid1 commented 6 years ago

Revised:

kaiju databases are occasionally downloaded in non-working states

charlesreid1 commented 6 years ago

Update: