GregoryFaust / samblaster

samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.
MIT License
225 stars 30 forks source link

Duplicates were removed but log says 0 dups removed #55

Open sehawk opened 1 year ago

sehawk commented 1 year ago

Hi @GregoryFaust

We are executing samblaster 0.1.26 using below command

samblaster_0.1.26.simg samblaster --ignoreUnmated  -a -r -i input.sam -o output.sam 

samblaster log

samblaster: Version 0.1.26
samblaster: Opening input.sam for read.
samblaster: Opening output.sam for write.
samblaster: Loaded 25 header sequence entries.
samblaster: Found        62867 of    2402373 (2.617%) total read ids are marked paired yet are unmated.
samblaster: Please double check that input file is read-id (QNAME) grouped.
samblaster: Found          838 of    2402373 (0.035%) total read ids with no primary alignment.
samblaster: Please double check that input file is read-id (QNAME) grouped.
samblaster: Removed          0 of    2402373 (0.000%) total read ids as duplicates using 18128k memory in 1.911S CPU seconds and 3S wall time.

Flagstats of input sam

4744272 + 0 in total (QC-passed reads + QC-failed reads)
4423 + 0 secondary
0 + 0 supplementary
2178640 + 0 duplicates
4743484 + 0 mapped (99.98% : N/A)
4739849 + 0 paired in sequencing
2372230 + 0 read1
2367619 + 0 read2
4724503 + 0 properly paired (99.68% : N/A)
4737919 + 0 with itself and mate mapped
1142 + 0 singletons (0.02% : N/A)
3131 + 0 with mate mapped to a different chr
3064 + 0 with mate mapped to a different chr (mapQ>=5)

Flagstats of output sam post executing samblaster

2565632 + 0 in total (QC-passed reads + QC-failed reads)
2740 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
2564846 + 0 mapped (99.97% : N/A)
2562892 + 0 paired in sequencing
1283161 + 0 read1
1279731 + 0 read2
2551398 + 0 properly paired (99.55% : N/A)
2560969 + 0 with itself and mate mapped
1137 + 0 singletons (0.04% : N/A)
2585 + 0 with mate mapped to a different chr
2525 + 0 with mate mapped to a different chr (mapQ>=5)

Questions

  1. Why does samblaster show this warning ? samblaster: Please double check that input file is read-id (QNAME) grouped. . Our input sam is already QNAME sorted. We verified that by checking SAM header SO (sort order)
  2. Why does samblaster say that 0 duplicates removed while we can see that duplicates were indeed removed as shown by samtools flagstat
    samblaster: Removed          0 of    2402373 (0.000%) total read ids as duplicates using 18128k memory in 1.911S CPU seconds and 3S wall time.
GregoryFaust commented 1 year ago

Sorry, I should have responded to this a long time ago.

  1. samblaster outputs the message to double check the input file sort order whenever it finds a read flagged as part of a pair, but the mate does not appear next to it in the input SAM file. samblaster does not use the SO information in the header of a SAM file because that is not a reliable way to determine if the file is sorted in a particular order or not. All it means is that some tool within your pipeline sorted it that way, but not that it is necessarily still in that sort order. For example, not all the available SAM/BAM sorting tools actually update that field. In addition, some SAM file manipulations essentially require that one process the header and the rest of the file separately leading to the header not being fully up to date. Perhaps most importantly, when samblaster is used in its "natural" place in a pipeline, processing a SAM file piped directly from the output of the aligner, the SAM file will be read-id grouped but not have been sorted in any order and won't have any SO field in the header (the SAM file will just be in whatever order the input fastq files are in). It is far more dependable, especially when received an error that may indicate a possible sorting issue, to look at the first chunk for alignments in the file to see if they are in fact sorted the way you expect.. In this case, I believe the file is properly sorted due to the small percentage of reads without mates (2.6%), but perhaps some unaligned reads have been removed??

  2. Oops, thanks for pointing this out. Currently, samblaster is only counting the number of duplicates when it is marking them itself. As you used the -a option, the duplicate marking algorithm has been bypassed. But even in this case, samblaster should still count the number of identified duplicates in the file so that accurate statistics can be output. (Note to self: move lines 1116-1118 down about 8 lines and predicate them on the flag state, not on whether or not the pair is added to the hash table.)