Removing PCR duplicates

b-dawes commented 7 years ago

Hello,

My lab is in the process of testing out Salmon and potentially switching to it from traditional aligners. With a traditional aligner, we use picard's MarkDuplicates to remove PCR duplicates. Is there a way to remove PCR duplicates with Salmon? I tried using --writeMappings to generate a BAM and feed that into picard, but I just get a SAM validation error: "Not primary alignment flag should not be set for unmapped read."

Best, Brian

rob-p commented 7 years ago

Hi Brian,

In general, I would argue that one should be cautious with removing PCR duplicates in RNA-seq data (unless you are dealing with reads with UMI tags). This is because reads that align to the same reference position can easily have come from alternative transcripts sharing the same underlying sequence. Hence, the normal tests used to infer PCR duplicates with e.g. DNA-seq reads can yield false-positives in RNA-seq. This is particularly true for highly abundant transcripts (or transcripts from highly-abundant genes).

We are currently working on the code that will do duplicate removal when UMI tags are present. That methodology can be extended to remove duplicates even without UMI tags --- though I'd generally caution against that for the reasons mentioned above. However, for the time being, if you have a strong need or desire to filter PCR duplicates, you could use alignment-based Salmon with a BAM file that has duplicates removed.

Finally, regarding the error you are getting during SAM validation; this sounds like a different issue. Would you mind providing a piece of that SAM file for me to take a look at? Specifically, I don't believe the quasi-mapping output file should even contain unmapped reads (unless you consider unmapped mates of orphaned reads).

--Rob

roryk commented 7 years ago

+1 on duplicate marking for RNA-seq data being counter productive. Picard can be pretty particular about the format of the SAM file, if you really want to mark the duplicates, sambamba is a lot faster and should complain less:

http://lomereiter.github.io/sambamba/docs/sambamba-markdup.html

b-dawes commented 7 years ago

Thanks for the responses.

Here's the records for the first read in my bam:

NB501285:81:HVJFGBGX2:1:11101:7287:1044 105     uc008cdk.2|C4b  2845    1       36M     =       2845    0       AGATTCTCCAAATTGAGAAGGAAGGAGCCATCCACA    *       NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 149     uc008cdk.2|C4b  2845    0       *       =       2845    0       ACCCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN    *       NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 361     uc008cdl.2|C4b  2845    1       36M     =       2845    0       AGATTCTCCAAATTGAGAAGGAAGGAGCCATCCACA    *       NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 405     uc008cdl.2|C4b  2845    0       *       =       2845    0       ACCCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN    *       NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 361     uc008cdp.1|C4a  2814    1       36M     =       2814    0       AGATTCTCCAAATTGAGAAGGAAGGAGCCATCCACA    *       NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 405     uc008cdp.1|C4a  2814    0       *       =       2814    0       ACCCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN    *       NH:i:3

Picard crashes on the fourth record with error:

ERROR: Record 4, Read name NB501285:81:HVJFGBGX2:1:11101:7287:1044, Not primary alignment flag should not be set for unmapped read.

It looks like the issue is that Salmon is setting bit 4 (read unmapped) for the mates and is setting bit 256 (not primary alignment) for the secondary alignments. Both bits 4 and 256 are set for the mate of a secondary alignment which picard doesn't like.

mmokrejs commented 6 years ago

@b-dawes Did you bring it to picard developers? I do not say salmon is not broken, but ... getting more opinions woul dbe welcome. @droazen Clues?

b-dawes commented 6 years ago

Hi @mmokrejs

I never did bring it up to the picard developers.

antgomo commented 5 years ago

Hi Brian,

In general, I would argue that one should be cautious with removing PCR duplicates in RNA-seq data (unless you are dealing with reads with UMI tags). This is because reads that align to the same reference position can easily have come from alternative transcripts sharing the same underlying sequence. Hence, the normal tests used to infer PCR duplicates with e.g. DNA-seq reads can yield false-positives in RNA-seq. This is particularly true for highly abundant transcripts (or transcripts from highly-abundant genes).

We are currently working on the code that will do duplicate removal when UMI tags are present. That methodology can be extended to remove duplicates even without UMI tags --- though I'd generally caution against that for the reasons mentioned above. However, for the time being, if you have a strong need or desire to filter PCR duplicates, you could use alignment-based Salmon with a BAM file that has duplicates removed.

Finally, regarding the error you are getting during SAM validation; this sounds like a different issue. Would you mind providing a piece of that SAM file for me to take a look at? Specifically, I don't believe the quasi-mapping output file should even contain unmapped reads (unless you consider unmapped mates of orphaned reads).

--Rob

It is in the latest Salmon release?

Thanks

stevennevins commented 5 years ago

I'm curious if there is code available to remove the duplicates based on UMI tags for RNA. I'm aware of Alevin, but I don't think this suits my purpose

Regards, Steven

ChenfuShi commented 4 years ago

Hi, is there any news on removing PCR duplicates in salmon? I think this is a major fallacy of this software when using paired end reads or reads with UMIs. Sometimes PCR duplication can be quite different between replicates and with paired end read it should be quite widely accepted that those are real PCR duplicates and should be removed.

Thanks!

COMBINE-lab / salmon

Removing PCR duplicates #136