Open heuermh opened 4 years ago
See #93
I tried reordering the command as in the issue you referenced
List(docker, run, -i, --rm, quay.io/biocontainers/sambamba:0.7.1--h148d290_0,
sambamba, markdup, -o, /dev/stdout, /dev/stdin)
...
sambamba 0.7.1
by Artem Tarasov and Pjotr Prins (C) 2012-2019
LDC 1.13.0 / DMD v2.083.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)
sambamba-markdup: Unrecognized option -o
Regardless, it is our application that is streaming BAM format to /dev/stdin
though, not sambamba
. Are you looking for an end-of-block or end-of-file marker, or is this not enough data in stream error due to an incomplete record?
Sambamba expects a proper BAM stream. I can't validate this way.
Sambamba expects a proper BAM stream. I can't validate this way.
I'm sorry, I'm not sure what you mean here, as far as I know we are sending a properly formatted BAM stream. It works well with several other bioinformatics tools.
I am trying to learn whether sambamba
is doing validation differently than other tools, or if there might be an issue with using /dev/stdin
instead of using stdin from a shell pipe.
It may help if you send an example we can reproduce.
No activity
Hello @pjotrp, sorry for not responding earlier. This is still an issue even with the latest 0.7.1 container.
To reproduce, running sambamba
locally
$ git clone https://github.com/heuermh/cannoli
$ checkout sambamba-markdup
$ mvn install
...
$ ./bin/cannoli-submit \
sambambaMarkdup \
CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam \
CEUTrio.HiSeq.WGS.b37.NA12878.20.21.markdup.alignments.adam
...
sambamba 0.7.1
by Artem Tarasov and Pjotr Prins (C) 2012-2019
LDC 1.20.1 / DMD v2.090.1 / LLVM9.0.1 / bootstrap LDC - the LLVM D compiler (1.20.1)
finding positions of the duplicate reads in the file...
sambamba 0.7.1
by Artem Tarasov and Pjotr Prins (C) 2012-2019
LDC 1.20.1 / DMD v2.090.1 / LLVM9.0.1 / bootstrap LDC - the LLVM D compiler (1.20.1)
finding positions of the duplicate reads in the file...
sorted 109868 end pairs
and 2807 single ends (among them 261 unmatched pairs)
collecting indices of duplicate reads... done in 89 ms
found 28940 duplicates
collected list of positions in 0 min 7 sec
sambamba-markdup: not enough data in stream
With sambamba
via docker
$ docker pull quay.io/biocontainers/sambamba:0.7.1--h984e79f_3
...
$ ./bin/cannoli-submit \
sambambaMarkdup \
-use_docker \
CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam \
CEUTrio.HiSeq.WGS.b37.NA12878.20.21.markdup.docker.alignments.adam
...
sambamba 0.7.1
by Artem Tarasov and Pjotr Prins (C) 2012-2019
LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)
sambamba 0.7.1
by Artem Tarasov and Pjotr Prins (C) 2012-2019
LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)
finding positions of the duplicate reads in the file...
finding positions of the duplicate reads in the file...
sorted 109868 end pairs
and 2807 single ends (among them 261 unmatched pairs)
collecting indices of duplicate reads... done in 113 ms
found 28940 duplicates
collected list of positions in 0 min 9 sec
sambamba-markdup: not enough data in stream
Prerequisites are OpenJDK, Apache Maven, and Docker, if running that way. Any BAM file of non-trivial size will do.
Either of these could be a big help:
/dev/stdin
--verbose
option, or another way to provide more details about why validation is failingThank you in advance!
Not enough data in stream gets reported when a file ends/closes before the expected end. BAM has block sizes, so the stream stops before parsing is done and an error gets reported. Probably best to try building a simple test case outside a container and see if you can replicate the problem.
If you can not replicate we could build a sambamba with debugging information and you can do a back trace.
Happy to help a fellow coder.
I just confirmed the issue with
cat test/issue_204.bam | ./bin/sambamba-0.8.0-linux-amd64-static markdup /dev/stdin test.bam
sambamba 0.8.0
by Artem Tarasov and Pjotr Prins (C) 2012-2020
LDC 1.10.0 / DMD v2.080.1 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.4)
finding positions of the duplicate reads in the file...
sorted 277 end pairs
and 5 single ends (among them 5 unmatched pairs)
collecting indices of duplicate reads... done in 0 ms
found 62 duplicates
collected list of positions in 0 min 0 sec
sambamba-markdup: not enough data in stream
markdup using a file runs normally. Other sambamba commands do not show this issue.
BTW on your comment support reading from stdin directly instead of via /dev/stdin: they are the one and same thing.
Good catch! Sorry I missed trying that way (cat | sambamba markdup /dev/stdin
) before; I only noticed that it worked ok on files.
The problem is that markdup, to avoid having the whole BAM file in RAM, reads the input file twice - once for getting the positions, then another round for writing the output file. Reading twice from stdin obviously won't work, so you get this error: "not enough data in stream".
In a shell this can work
sambamba-0.8.0 markdup /dev/stdin \
ex1_header.dedup.bam < test/issue_204.sorted.bam
but that may imply an overhead because you are dealing with a file rather than an interprocess communication.
Looking at the code I think this can be made a single pass for a sorted BAM. That would make it faster. A future version could also hold an unsorted BAM file in RAM because 128GB machines are getting increasingly common.
Are there any plans to fix this behavior? Technically, I'd argue that it's still a bug, as sambamba accepts input that has been piped to it from another application, but fails to process the input correctly (or as expected).
Basically, the documentation implies that sambamba-markdup can be used to pipe commands together, but will throw this error if one tries to combine utilities into a single command, e.g.
bwa mem -R $RG_string $REFERENCE -t 8 fastq_R1.fq fastq_R2.fq \
| samtools view -b -S -u - \
sambamba markdup -t 8 /dev/stdin $OUTFILE
Until this is fixed/updated, I don't understand the rationale for letting markdup accept input from stdin, since this would seem like the primary scenario in which this functionality would be used.
You are right, we should not allow piping. Feel free to submit a PR.
While I realize our use case is somewhat complicated (running sambamba via docker executed in an Apache Spark cluster), I am hoping the underlying issue may be simple.
Is there something about BAM format streamed to sambamba over
/dev/stdin
that would cause anot enough data in stream
error?Alternatively, might it be possible to read from stdin and write to stdout directly?
https://github.com/bigdatagenomics/cannoli/pull/203
If not clear from the log below, we're using sambamba version 0.7.1 via Docker image tag
quay.io/biocontainers/sambamba:0.7.1--h148d290_0
. Similar issues are seen when using version 0.6.8 installed locally via homebrew (0.7.1 does not appear to be available via brewsci/bio/sambamba yet).