GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
221 stars 30 forks source link

Can't find first and/or second of pair in sam block of length 1 for id:... #37

Closed bibbers93 closed 4 years ago

bibbers93 commented 6 years ago

Hi, I've recently downloaded SAMBLASTER for retrieving softclipped, split, and unmapped reads from whole genome sequencing data. At the moment I'm practicing on chromosome 17 of NA12878 (available from GIAB).

I've extracted chr17, generated a bam file and indexed this all using samtools. I ran

samtools view -h input.bam | samblaster -a -e -d dup.sam -s split.sam -u unmap.sam -o /dev/null 

In this first instance it runs, but I then get the following messgaes to my screen

samblaster: Inputting from stdin
samblaster: Opening /dev/null for write.
samblaster: Opening 17dup.sam for write.
samblaster: Opening 17soft.sam for write.
samblaster: Opening 17unmap.sam for write.
samblaster: Loaded 86 header sequence entries.
samblaster: Can't find first and/or second of pair in sam block of length 1 for id: H06HDADXX130110:1:1101:3355:34950
samblaster:    At location: 17:798227
samblaster:    Are you sure the input is sorted by read ids?

In checking, I don't think my original input.bam was sorted by QNAME, so I've done the following

samtools sort -n input.bam #fyi, I can't now index this, is that a problem?
samtools view -h input.bam | samblaster -a -e -d dup.sam -s split.sam -u unmap.sam -o /dev/null 

In re-running this, I get the same error messages as before BUT there are output .sam files in my current directory. Has Samblaster quit with only a part of the data in these -d -u -s files, or is that the final output? and how can I overcome these errors about my paired reads and sorting by read ID

Thanks!! :)

GregoryFaust commented 6 years ago

Using default parameters, samblaster treats unmated reads as a fatal error and stops processing the input at that location in the input SAM file. This is because this error is usually caused by a mis-sorted input file as in your first run. Once you properly name-sorted the input file, the sort issue is resolved. However, because you have selected reads mapped to a single chromosome, your input file will undoubtedly still have unmated reads due to read pairs in which the two reads map to different chromosomes. That is why you ended up with a partial output file, as the input was processed until samblaster reached the first such read.

When you are REALLY sure that your input is properly sorted, you can use the --ignoreUnmated option to allow samblaster to continue to process the input past such unmated reads. If you use this option, samblaster will output to stderr all of the read-ids that were unmated, and give a count of unmated reads with the rest of the statistics at the end of the run. Therefore, you might want to consider piping or redirecting stderr to a file to capture this output and avoid a huge volume of output to the screen.

GregoryFaust commented 4 years ago

N.B. samblaster no longer outputs the ids of unmated reads to stderr when using the --ignoreUnmated option, but still outputs the count of the number of unmated primary reads found.

GregoryFaust commented 4 years ago

Release 0.1.25 adds better explanations and usage scenarios for the --ignoreUnmated option in both the README.md and the program help.