GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

Too many sequences in header of input sam file #19

Closed vladimirkovacevic closed 8 years ago

vladimirkovacevic commented 8 years ago

When the number of sequences inside SAM/BAM file is larger than 32768 samblaster will exit on the following location in the source(samblaster.cpp):

if (count >= (1 << 15))
        {
            fatalError("samblaster: Too many sequences in header of input sam file.  Exiting.\n");
        }

How can this error be caught when it is used piping with bwa mem and sambamba view/sort? In this scenario execution finishes fine(without an error), while the output BAM file is corrupted (contains 1.5MB, while input fastq files are ~50GB).

vyx-greg-faust commented 8 years ago

From what environment are you calling samblaster with the pipes? From bash? From within Python? It seems that some process in the piped list is ignoring SIGPIPE. This is a known problem in older versions of Python. Also, did the whole piped command run for long? It should terminate almost instantly, and the error message from samblaster should be seen on stderr.

Can you send the entire piped command?

Thanks, Greg

vladimirkovacevic commented 8 years ago

Hi Greg! No python involved, it is called from bash command line. It ended in ~2 minutes. True, there is a message on stderr, but the execution finished successfully. Here is the command line: /opt/bwa-0.7.13/bwa mem -M -R '@RG ID:1' -t 30 reference.fa 001.fastq.gz 002.fastq.gz | _/opt/samblaster/samblaster -r -i /dev/stdin -o /dev/stdout _| /opt/sambamba_v0.6.0 view -t 8 -f bam -l 0 -S /dev/stdin | /opt/sambamba_v0.6.0 sort -t 8 -m 10GiB --tmpdir ./ -o out.bam -l 5 /dev/stdin

Anyway, samblaster calls exit(1) in fatalError() which is perfectly fine, so it is definitely, as you said, an environment issue, just wanted to ask you if you maybe know what exactly.

Thanks in advance, Vladimir

P. S. Great tool, thanks for creating and supporting it!

GregoryFaust commented 8 years ago

So, what were you hoping for. Did the entire pipe not return an error, such as non-zero error code? Is that what you need? I used to do an abort() in fatalError, but his does a core dump which is pretty unfriendly. In addition, if you are calling this pipe from within a bash shell script, you can use a -e on the command line which should cause the entire script to exit.

Thanks, Greg

On Wed, Mar 30, 2016 at 5:32 AM, vladimirkovacevic <notifications@github.com

wrote:

Hi Greg! No python involved, it is called from bash command line. It ended in ~2 minutes. True, there is a message on stderr, but the execution finished successfully. Here is the command line: /opt/bwa-0.7.13/bwa mem -M -R '@RG https://github.com/RG ID:1' -t 30 reference.fa 001.fastq.gz 002.fastq.gz | _/opt/samblaster/samblaster -r -i /dev/stdin -o /dev/stdout _| /opt/sambamba_v0.6.0 view -t 8 -f bam -l 0 -S /dev/stdin | /opt/sambamba_v0.6.0 sort -t 8 -m 10GiB --tmpdir ./ -o out.bam -l 5 /dev/stdin

Thanks in advance, Vladimir

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/GregoryFaust/samblaster/issues/19#issuecomment-203345299

vladimirkovacevic commented 8 years ago

Yes, I need non-zero error code from the entire pipe or last tool in the pipe (sambamba), which did not receive any data from its stdin, so it should throw an error. As I said, this issue is not related to samblaster, it is for sambamba. Thanks for the answers and clarification.

GregoryFaust commented 8 years ago

Actually, sambamba will receive all of the sam header up until the number of sequences goes over the samblaster limit. So, it will have what appears to be a valid sam file as input, one with lots of header information but no reads. BWA, on the other hand, should notice that its writes are failing and exit with a non-zero code. However, the following man page seems to indicate that the overall error code for a piped command is the return code of the last command, or sambamba in this case, which as indicated above isn't really in an error condition. http://www.tldp.org/LDP/abs/html/exit-status.html

vladimirkovacevic commented 8 years ago

Greg, just one notice for you. Currently there is the FDA Challenge currently open: https://precision.fda.gov/challenges/consistency They are searching for the best pipeline to process offered data (fastq) and at the end of the contest they will publish it. Both samples they offer for processing contain more sequences in the header than samblaster support. This will probably eliminate samblaster from being present in this pipeline, which is a pity because it is a great tool! Just wanted to let you know so you maybe might consider in future to add support for more sequences, if possible.

Thanks, Vladimir

GregoryFaust commented 8 years ago

The challenge states to use GRCh37. What version of GRCh37 are you using that has more than 32K contigs?

On Mon, Apr 4, 2016 at 5:15 AM, vladimirkovacevic notifications@github.com wrote:

Greg, just one notice for you. Currently there is the FDA Challenge currently open: https://precision.fda.gov/challenges/consistency They are searching for the best pipeline to process offered data (fastq) and at the end of the contest they will publish it. Both samples they offer for processing contain more sequences in the header than samblaster support. This will probably eliminate samblaster from being present in this pipeline, which is a pity because it is a great tool! Just wanted to let you know so you maybe might consider in future to add support for more sequences, if possible.

Thanks, Vladimir

— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub https://github.com/GregoryFaust/samblaster/issues/19#issuecomment-205206682