TheJacksonLaboratory / cs-nf-pipelines

The Jackson Laboratory Computational Sciences Nextflow based analysis pipelines
MIT License
18 stars 10 forks source link

Add pipefail to bowtie process #5

Closed alanhoyle closed 2 months ago

alanhoyle commented 2 months ago

If a corrupted FASTQ is input, the process creates corrupted output but does not record the error and cause Nextflow to realize the process has failed.

This is because the error comes from the initial zcat and since it goes through a set of unix pipes, the exit code is not caught.

This also changes the formatting of the nextflow script and the .command.sh file so that each process in the pipe is on a separate line.

MikeWLloyd commented 2 months ago

Thanks for the contribution. We will take a look!

MikeWLloyd commented 2 months ago

can you provide a corrupted fastq example to run through the change?

alanhoyle commented 2 months ago

Just take a fastq and truncate it randomly. That's what happened here.....

MikeWLloyd commented 2 months ago

I've taken a fastq unzipped, and truncated it randomly and then re-gzipped it. However, the behavior I see with Bowtie is that it aligns those reads that are available. In our case, Bowtie is used by EMASE/GBRS and reads are aligned independently if PE reads are used. If a read is corrupt or truncated, Bowtie ignores it and the process completely without error. If I remove reads in one pair, I get an error in the downstream processes that recombine PE reads, but that is expected behavior. Is this issue you are catching related to the gzip file itself being corrupted or incomplete?

alanhoyle commented 2 months ago

@MikeWLloyd , the issue is that the compressed file is corrupt. Truncate the compressed fastq.gz with something like `truncate -s 10000000 copy_of_your.fastq.gz'

Bash/nextflow don't catch errors except in the last command in a series of pipes without a set -o pipefail set.

Below is a mildly edited thing showing the error we saw:


$ cat .command.sh
#!/bin/bash -ue
zcat SAMPLE1_S1_R1_001.fastq.gz | bowtie -p 8 -q -a --best --strata --sam -v 3 -x /path/to/user/EMASE_GBRS_ref_data/zenodo.org/records/8289936/files/bowtie/bowtie.transcripts - 2> S1.bowtie_R1.log | samtools view -bS - > S1_mapped_R1.bam
$ cat .command.log

gzip: SAMPLE1_S1_R1_001.fastq.gz: unexpected end of file
$ cat .exitcode
0$
alanhoyle commented 2 months ago

Note that when this occurs, the bowtie process finishes with an exit code of 0.

The next process after bowtie gbrs bam2emase -i SAMPLE1_mapped_R1.bam [...] finishes with an exit code of 0, and a couple errors in the .command.log:

[E::idx_find_and_load] Could not retrieve index file for '/path/to/nextflow/work/0f/9a118faba0e4397a5094d7e83cc40f/S1_mapped_R1.bam'
[E::idx_find_and_load] Could not retrieve index file for '/path/to/nextflow/work/0f/9a118faba0e4397a5094d7e83cc40f/S1_mapped_R1.bam'

and it's not until the next step when it combines the R1/R2 files in a single step that the process fails to generate ae expected output file and throws an error stopping the workflow:

$ cat .command.sh
emase get-common-alignments  -i SAMPLE1_mapped_R1.emase.h5 -i SAMPLE1_mapped_R2.emase.h5 -o SAMPLE1.merged.emase.h5
$ cat .command.log
gbrs [04:40:36] operands could not be broadcast together with shapes
                (202367808,) (204354899,)
MikeWLloyd commented 2 months ago
gbrs [04:40:36] operands could not be broadcast together with shapes
                (202367808,) (204354899,)

This is an expected error as the reads are no longer paired.

However, you are correct that given a corrupted gzip, it should not get this far. Using the truncate command (thanks for providing the command, I wasn't aware of the tool), I was able to replicate the issue, and that your catch works. I wanted to replicate this issue to determine if we need to be conscious of it in other modules. We hadn't considered corrupted files in our testing, and will do so moving forward.

I am working on a minor release and will incorporate this PR there. We run a dev env in a separate repo, so I am currently trying to align things.

Thanks again for your contribution.