Closed sooheelee closed 5 years ago
To follow up, the resulting file does not validate.
-bash-3.2$ java -jar $PICARD ValidateSamFile INPUT=snippet_removeduplicates_markduplicates.bam MODE=SUMMARY
[Thu Feb 18 11:40:32 EST 2016] picard.sam.ValidateSamFile INPUT=snippet_removeduplicates_markduplicates.bam MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Feb 18 11:40:32 EST 2016] Executing as shlee@gsa3.broadinstitute.org on Linux 2.6.18-371.el5 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132; Picard version: 1.1010(a4d54b4cc728030c92afe147fb013f1f2a9a0a39_1455717603) IntelDeflater
## HISTOGRAM java.lang.String
Error Type Count
ERROR:MATE_NOT_FOUND 255
[Thu Feb 18 11:40:39 EST 2016] picard.sam.ValidateSamFile done. Elapsed time: 0.11 minutes.
Runtime.totalMemory()=2352480256
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
This is in my opinion a minor bug given that my recent tutorial on MarkDuplicates recommends keeping all reads and not removing duplicates consistent with a lossless operating procedure.
I think that this is no longer the case when the input file is queryname sorted. @sooheelee can you please check to see if that is good enough?
When the mapped mate of a singly mapping pair is marked as duplicate, its unmapped mate is not marked as duplicate. This is expected behavior as MarkDuplicates ignores unmapped reads from consideration.
If, however, I add 'REMOVE_DUPLICATES'=true to the MarkDuplicates command, the singly mapped duplicates are removed but the unmapped mates remain in the resulting file. This is undesirable behavior and these unmapped mates should also be removed.
In
6747_snippet_piped_markduplicates.bam
, in which duplicates are marked according to Tutorial#6747 using MarkDuplicates, I have 255 singly mapping reads that are also duplicate (SAM flag1033
) and no unmapped reads marked as duplicate (SAM flag1029
).Using the following command, I get a list of read names for duplicate singly mapping records:
Let's use the last record
H0164ALXX140820:2:2224:31825:66180
as the case in point. This is what the record looks like in6747_snippet_piped_markduplicates.bam
:Here's the example command to remove the duplicates and results from grepping for two of the records. As you can see, the unmapped mate of the duplicate singly mapped read remains.
In fact, doing a simple count shows a difference of 255 reads in the mate-unmapped category but no difference in unmapped reads category:
This is undesirable behavior, or at least inconsistent behavior. This bug report is from my personal testing.