biod / sambamba

Tools for working with SAM/BAM data
http://thebird.nl/blog/D_Dragon.html
GNU General Public License v2.0
557 stars 104 forks source link

sambamba-markdup: not enough data in stream via /dev/stdin #421

Open heuermh opened 4 years ago

heuermh commented 4 years ago

While I realize our use case is somewhat complicated (running sambamba via docker executed in an Apache Spark cluster), I am hoping the underlying issue may be simple.

Is there something about BAM format streamed to sambamba over /dev/stdin that would cause a not enough data in stream error?

Alternatively, might it be possible to read from stdin and write to stdout directly?

https://github.com/bigdatagenomics/cannoli/pull/203

If not clear from the log below, we're using sambamba version 0.7.1 via Docker image tag quay.io/biocontainers/sambamba:0.7.1--h148d290_0. Similar issues are seen when using version 0.6.8 installed locally via homebrew (0.7.1 does not appear to be available via brewsci/bio/sambamba yet).

$ ./bin/cannoli-submit \
  sambambaMarkdup \
  -use_docker \
  NA12878.alignedHg38.duplicateMarked.baseRealigned.bam \
  NA12878.alignedHg38.duplicateMarked.baseRealigned.markdup.alignments.adam
...
19/12/17 10:31:18 ERROR Utils: Aborting task
java.lang.RuntimeException: Piped command List(docker, run, -i, --rm,
quay.io/biocontainers/sambamba:0.7.1--h148d290_0, sambamba, markdup,
/dev/stdin, /dev/stdout) exited with error code 1.
...
sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.13.0 / DMD v2.083.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)
...
  sorted 1276 end pairs
     and 39 single ends (among them 26 unmatched pairs)
  collecting indices of duplicate reads...   done in 9 ms
  found 2 duplicates
collected list of positions in 0 min 1 sec
sambamba-markdup: not enough data in stream
  sorted 2432 end pairs
     and 86 single ends (among them 45 unmatched pairs)
  collecting indices of duplicate reads...   done in 16 ms
  found 4 duplicates
collected list of positions in 0 min 1 sec
sambamba-markdup: not enough data in stream
  sorted 1402 end pairs
     and 59 single ends (among them 47 unmatched pairs)
  collecting indices of duplicate reads...   done in 4 ms
  found 2 duplicates
collected list of positions in 0 min 1 sec
sambamba-markdup: not enough data in stream
pjotrp commented 4 years ago

See #93

heuermh commented 4 years ago

I tried reordering the command as in the issue you referenced

List(docker, run, -i, --rm, quay.io/biocontainers/sambamba:0.7.1--h148d290_0,
sambamba, markdup, -o, /dev/stdout, /dev/stdin)
...
sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.13.0 / DMD v2.083.1 / LLVM7.0.1 / bootstrap LDC - the LLVM D compiler (0.17.6)

sambamba-markdup: Unrecognized option -o

Regardless, it is our application that is streaming BAM format to /dev/stdin though, not sambamba. Are you looking for an end-of-block or end-of-file marker, or is this not enough data in stream error due to an incomplete record?

pjotrp commented 4 years ago

Sambamba expects a proper BAM stream. I can't validate this way.

heuermh commented 4 years ago

Sambamba expects a proper BAM stream. I can't validate this way.

I'm sorry, I'm not sure what you mean here, as far as I know we are sending a properly formatted BAM stream. It works well with several other bioinformatics tools.

I am trying to learn whether sambamba is doing validation differently than other tools, or if there might be an issue with using /dev/stdin instead of using stdin from a shell pipe.

pjotrp commented 4 years ago

It may help if you send an example we can reproduce.

pjotrp commented 3 years ago

No activity

heuermh commented 3 years ago

Hello @pjotrp, sorry for not responding earlier. This is still an issue even with the latest 0.7.1 container.

To reproduce, running sambamba locally

$ git clone https://github.com/heuermh/cannoli
$ checkout sambamba-markdup
$ mvn install
...
$ ./bin/cannoli-submit \
  sambambaMarkdup \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.markdup.alignments.adam
...

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.1 / DMD v2.090.1 / LLVM9.0.1 / bootstrap LDC - the LLVM D compiler (1.20.1)

finding positions of the duplicate reads in the file...

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.1 / DMD v2.090.1 / LLVM9.0.1 / bootstrap LDC - the LLVM D compiler (1.20.1)

finding positions of the duplicate reads in the file...
  sorted 109868 end pairs
     and 2807 single ends (among them 261 unmatched pairs)
  collecting indices of duplicate reads...   done in 89 ms
  found 28940 duplicates
collected list of positions in 0 min 7 sec
sambamba-markdup: not enough data in stream

With sambamba via docker

$ docker pull quay.io/biocontainers/sambamba:0.7.1--h984e79f_3
...
$ ./bin/cannoli-submit \
  sambambaMarkdup \
  -use_docker \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam \
  CEUTrio.HiSeq.WGS.b37.NA12878.20.21.markdup.docker.alignments.adam
...

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)

sambamba 0.7.1
 by Artem Tarasov and Pjotr Prins (C) 2012-2019
    LDC 1.20.0 / DMD v2.090.1 / LLVM7.0.0 / bootstrap LDC - the LLVM D compiler (0.17.6)

finding positions of the duplicate reads in the file...
finding positions of the duplicate reads in the file...
  sorted 109868 end pairs
     and 2807 single ends (among them 261 unmatched pairs)
  collecting indices of duplicate reads...   done in 113 ms
  found 28940 duplicates
collected list of positions in 0 min 9 sec
sambamba-markdup: not enough data in stream

Prerequisites are OpenJDK, Apache Maven, and Docker, if running that way. Any BAM file of non-trivial size will do.

Either of these could be a big help:

Thank you in advance!

pjotrp commented 3 years ago

Not enough data in stream gets reported when a file ends/closes before the expected end. BAM has block sizes, so the stream stops before parsing is done and an error gets reported. Probably best to try building a simple test case outside a container and see if you can replicate the problem.

If you can not replicate we could build a sambamba with debugging information and you can do a back trace.

pjotrp commented 3 years ago

Happy to help a fellow coder.

pjotrp commented 3 years ago

I just confirmed the issue with

cat test/issue_204.bam | ./bin/sambamba-0.8.0-linux-amd64-static markdup /dev/stdin test.bam

sambamba 0.8.0
 by Artem Tarasov and Pjotr Prins (C) 2012-2020
    LDC 1.10.0 / DMD v2.080.1 / LLVM6.0.1 / bootstrap LDC - the LLVM D compiler (0.17.4)

finding positions of the duplicate reads in the file...
  sorted 277 end pairs
     and 5 single ends (among them 5 unmatched pairs)
  collecting indices of duplicate reads...   done in 0 ms
  found 62 duplicates
collected list of positions in 0 min 0 sec
sambamba-markdup: not enough data in stream

markdup using a file runs normally. Other sambamba commands do not show this issue.

BTW on your comment support reading from stdin directly instead of via /dev/stdin: they are the one and same thing.

heuermh commented 3 years ago

Good catch! Sorry I missed trying that way (cat | sambamba markdup /dev/stdin) before; I only noticed that it worked ok on files.

pjotrp commented 3 years ago

The problem is that markdup, to avoid having the whole BAM file in RAM, reads the input file twice - once for getting the positions, then another round for writing the output file. Reading twice from stdin obviously won't work, so you get this error: "not enough data in stream".

In a shell this can work

sambamba-0.8.0 markdup /dev/stdin \
  ex1_header.dedup.bam < test/issue_204.sorted.bam

but that may imply an overhead because you are dealing with a file rather than an interprocess communication.

pjotrp commented 3 years ago

Looking at the code I think this can be made a single pass for a sorted BAM. That would make it faster. A future version could also hold an unsorted BAM file in RAM because 128GB machines are getting increasingly common.

pwaltman commented 3 years ago

Are there any plans to fix this behavior? Technically, I'd argue that it's still a bug, as sambamba accepts input that has been piped to it from another application, but fails to process the input correctly (or as expected).

Basically, the documentation implies that sambamba-markdup can be used to pipe commands together, but will throw this error if one tries to combine utilities into a single command, e.g.

bwa mem -R $RG_string $REFERENCE -t 8 fastq_R1.fq fastq_R2.fq \
| samtools view -b -S -u - \
   sambamba markdup -t 8 /dev/stdin $OUTFILE

Until this is fixed/updated, I don't understand the rationale for letting markdup accept input from stdin, since this would seem like the primary scenario in which this functionality would be used.

pjotrp commented 3 years ago

You are right, we should not allow piping. Feel free to submit a PR.