broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
976 stars 369 forks source link

MarkDuplicates misses duplicates of read pairs with mapping quality 0 on one end. #1285

Open yfarjoun opened 5 years ago

yfarjoun commented 5 years ago

When a chimeric read pair has one end which aligns well and the other aligns with MappingQuality ==0 the algorithm breaks down and fails to mark duplicates of this template. This is because the aligner will randomly place read2 (the one with MQ0) and since both reads are marked as Mapped, MD considers the position of both reads.

I propose changing this behavior so that a read which is mapped with MQ0 is considered unmapped for the purposes of marking duplicates. This will resolve the problem.

Do folks see an issue with this? Do some aligners use MQ0 to mean something other than "read has more than one equivalent alignment"? Is there value in keeping these (likely) duplicates-reads marked as non-duplicates?

@tfenne @nh13 @lbergelson @jamesemery I would like to hear your opinions on this.

tfenne commented 5 years ago

@yfarjoun Where do you draw the line on this? What if the mate mapping quality is low enough that one sequencing error could throw it to another position in the genome?

I would edit your title to say "duplicates of ... read pairs mapping quality 0 on one end". This can just as easily happy when e.g. one end of a pair is in unique sequence and the other end is in a small tandem duplication or locally-repetitive region within the expected insert size.

Are you unconcerned about pairs where both ends are mapq=0 because the use case you're looking at will ignore those reads? Because the same problem arises there?

In general I see the problem you're raising, and don't disagree that it should be addressed, but I think the scope is broader than you're suggesting.

yfarjoun commented 5 years ago

I agree that the problem is perhaps greater than I am tackling in this issue...(changed the name to MQ0 , thanks)

However...I think that the incremental model is appropriate. We deal with the problems that actually affect us and our work. as we are sequencing more and more FFPE samples and they have really bad chimeras compounded with PCR duplication, the mapped reads are not being marked as duplicates due to their MQ0 mates. I have not seen this issue with MQ=3 or other low value MQ, and so am suggesting to use MQ0 as a way to eliminate the large part of the problem.

As you say, I am unconcerned by both mates having MQ0 since they will be filtered out and will not cause downstream problems. Just as we do not worry about marking completely unmapped pairs (where both reads are unmapped)...I'm sure that there are many duplicates of that form.

I think that it's relatively straightforward to resolve the single-ended MQ0 problem, while it will be more difficult to resolve the low MQ and the both-sides-MQ. So I propose to fix the lower-effort, higher-value part of the larger problem.

yfarjoun commented 5 years ago

I just realized that this commit https://github.com/broadinstitute/picard/commit/e46438dc3296b2e7365d8c180e6fd5aa3e647f2c by @nh13 (on a branch) is the beginning of an implementation for this.

Since I'd like to fix this issue, I'd like to know if there's a reason you didn't submit a PR for that branch, @nh13

nh13 commented 5 years ago

It was four years ago. I don’t remember what I had for lunch last week. Sorry.

yfarjoun commented 5 years ago

relevant issue from 4 years ago: https://github.com/broadinstitute/picard/issues/128