broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
975 stars 369 forks source link

[Bug] When removing duplicates, MarkDuplicates does not remove unmapped records whose mate is a duplicate #451

Closed sooheelee closed 5 years ago

sooheelee commented 8 years ago

When the mapped mate of a singly mapping pair is marked as duplicate, its unmapped mate is not marked as duplicate. This is expected behavior as MarkDuplicates ignores unmapped reads from consideration.

If, however, I add 'REMOVE_DUPLICATES'=true to the MarkDuplicates command, the singly mapped duplicates are removed but the unmapped mates remain in the resulting file. This is undesirable behavior and these unmapped mates should also be removed.

In 6747_snippet_piped_markduplicates.bam, in which duplicates are marked according to Tutorial#6747 using MarkDuplicates, I have 255 singly mapping reads that are also duplicate (SAM flag 1033) and no unmapped reads marked as duplicate (SAM flag 1029).

-bash-3.2$ samtools view -c -f 1033 6747_snippet_piped_markduplicates.bam
255
-bash-3.2$ samtools view -c -f 1029 6747_snippet_piped_markduplicates.bam
0

Using the following command, I get a list of read names for duplicate singly mapping records:

-bash-3.2$ samtools view -f 1033 6747_snippet_piped_markduplicates.bam | cut -f1 | sort | uniq
H0164ALXX140820:2:1101:1944:34606
H0164ALXX140820:2:1101:20356:2293
. [truncated]
H0164ALXX140820:2:2224:31389:66479
H0164ALXX140820:2:2224:31825:66180

Let's use the last record H0164ALXX140820:2:2224:31825:66180 as the case in point. This is what the record looks like in 6747_snippet_piped_markduplicates.bam:

H0164ALXX140820:2:2224:31825:66180  1113    10  96663263    60  72S79M  =   96663263    0   TGTTGTCGTTGCGTGTTTGGGGGGTGGGTTGGTTTTGCCGTTAGGTAGGGTGAGTGGGTCGGAAGTGCTGACAAACAGCAAGAGTAATATAGTTAATACACTCCCCTGAAAACCTGCTAAAAGAGTAGATTTTAGGTGCTTATACCACGAA #############################################################################AAJ-FAJFA<7F7JA-JA---J-J<-JJ-JFF-A-FJJ-JAFJF--FJF<FFJJAAFFJFJJJJJJFJJFFAFA MD:Z:30A48  PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:1  UQ:i:12 AS:i:74 XS:i:36
H0164ALXX140820:2:2224:31825:66180  165 10  96663263    0   *   =   96663263    0   ACCACCAAACCAAAAAACGACACAACCCCCCCCACACCAACACCAAAAAAACAACCACCAAACAAAACCAACACACGAAGACACGACAAGACAGGGCAGTAAAGACAGCAAGTGTAAAGCAACGGAACAGCACCAGACAGAGTAGTGAAGT AAA---<<7-AF---<F7---<---JA-F-F<<-FF<<<F############################################################################################################### MC:Z:72S79M PG:Z:MarkDuplicates RG:Z:H0164.2

Here's the example command to remove the duplicates and results from grepping for two of the records. As you can see, the unmapped mate of the duplicate singly mapped read remains.

-bash-3.2$ java -jar $PICARD MarkDuplicates INPUT=6483_snippet_piped.bam OUTPUT=snippet_removeduplicates_markduplicates.bam METRICS_FILE=snippet_removeduplicates_markduplicates_metrics.txt OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 REMOVE_DUPLICATES=true CREATE_INDEX=true
-bash-3.2$ samtools view snippet_removeduplicates_markduplicates.bam | grep 'H0164ALXX140820:2:2224:31825:66180'
H0164ALXX140820:2:2224:31825:66180  165 10  96663263    0   *   =   96663263    0   ACCACCAAACCAAAAAACGACACAACCCCCCCCACACCAACACCAAAAAAACAACCACCAAACAAAACCAACACACGAAGACACGACAAGACAGGGCAGTAAAGACAGCAAGTGTAAAGCAACGGAACAGCACCAGACAGAGTAGTGAAGT AAA---<<7-AF---<F7---<---JA-F-F<<-FF<<<F############################################################################################################### MC:Z:72S79M PG:Z:MarkDuplicates RG:Z:H0164.2
-bash-3.2$ samtools view snippet_removeduplicates_markduplicates.bam | grep 'H0164ALXX140820:2:2224:31389:66479'
H0164ALXX140820:2:2224:31389:66479  165 10  96663251    0   *   =   96663251    0   ATCTTGAAACTTAACAAGAAGATACTGCCGTTTGCATCAGAATGGAATGAAAATGGAGGATCATATGATAAGTGAAATAACCACGACAATGATGAAAACGCCTCGAACCATAACAACGCAGAAACGAATCATAAAAAGAATAAACAAGAGG <--<--AF<--AFA-J7J-J-A<A------F-AF--A-A--7-J-7-<-FA-<--F-J-7----A-<A-7<A-JA--A-<<-A-F################################################################## MC:Z:60S91M PG:Z:MarkDuplicates RG:Z:H0164.2

In fact, doing a simple count shows a difference of 255 reads in the mate-unmapped category but no difference in unmapped reads category:

-bash-3.2$ samtools view -c -f 4 snippet_removeduplicates_markduplicates.bam
1211
-bash-3.2$ samtools view -c -f 8 snippet_removeduplicates_markduplicates.bam
2380
-bash-3.2$ samtools view -c -f 4 6483_snippet_piped.bam
1211
-bash-3.2$ samtools view -c -f 8 6483_snippet_piped.bam
2635

This is undesirable behavior, or at least inconsistent behavior. This bug report is from my personal testing.

sooheelee commented 8 years ago

To follow up, the resulting file does not validate.

-bash-3.2$ java -jar $PICARD ValidateSamFile INPUT=snippet_removeduplicates_markduplicates.bam MODE=SUMMARY
[Thu Feb 18 11:40:32 EST 2016] picard.sam.ValidateSamFile INPUT=snippet_removeduplicates_markduplicates.bam MODE=SUMMARY    MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Feb 18 11:40:32 EST 2016] Executing as shlee@gsa3.broadinstitute.org on Linux 2.6.18-371.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132; Picard version: 1.1010(a4d54b4cc728030c92afe147fb013f1f2a9a0a39_1455717603) IntelDeflater

## HISTOGRAM    java.lang.String
Error Type  Count
ERROR:MATE_NOT_FOUND    255

[Thu Feb 18 11:40:39 EST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.11 minutes.
Runtime.totalMemory()=2352480256
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
sooheelee commented 8 years ago

This is in my opinion a minor bug given that my recent tutorial on MarkDuplicates recommends keeping all reads and not removing duplicates consistent with a lossless operating procedure.

yfarjoun commented 6 years ago

I think that this is no longer the case when the input file is queryname sorted. @sooheelee can you please check to see if that is good enough?