Open b-dawes opened 7 years ago
Hi Brian,
In general, I would argue that one should be cautious with removing PCR duplicates in RNA-seq data (unless you are dealing with reads with UMI tags). This is because reads that align to the same reference position can easily have come from alternative transcripts sharing the same underlying sequence. Hence, the normal tests used to infer PCR duplicates with e.g. DNA-seq reads can yield false-positives in RNA-seq. This is particularly true for highly abundant transcripts (or transcripts from highly-abundant genes).
We are currently working on the code that will do duplicate removal when UMI tags are present. That methodology can be extended to remove duplicates even without UMI tags --- though I'd generally caution against that for the reasons mentioned above. However, for the time being, if you have a strong need or desire to filter PCR duplicates, you could use alignment-based Salmon with a BAM file that has duplicates removed.
Finally, regarding the error you are getting during SAM validation; this sounds like a different issue. Would you mind providing a piece of that SAM file for me to take a look at? Specifically, I don't believe the quasi-mapping output file should even contain unmapped reads (unless you consider unmapped mates of orphaned reads).
--Rob
+1 on duplicate marking for RNA-seq data being counter productive. Picard can be pretty particular about the format of the SAM file, if you really want to mark the duplicates, sambamba is a lot faster and should complain less:
http://lomereiter.github.io/sambamba/docs/sambamba-markdup.html
Thanks for the responses.
Here's the records for the first read in my bam:
NB501285:81:HVJFGBGX2:1:11101:7287:1044 105 uc008cdk.2|C4b 2845 1 36M = 2845 0 AGATTCTCCAAATTGAGAAGGAAGGAGCCATCCACA * NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 149 uc008cdk.2|C4b 2845 0 * = 2845 0 ACCCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN * NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 361 uc008cdl.2|C4b 2845 1 36M = 2845 0 AGATTCTCCAAATTGAGAAGGAAGGAGCCATCCACA * NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 405 uc008cdl.2|C4b 2845 0 * = 2845 0 ACCCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN * NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 361 uc008cdp.1|C4a 2814 1 36M = 2814 0 AGATTCTCCAAATTGAGAAGGAAGGAGCCATCCACA * NH:i:3
NB501285:81:HVJFGBGX2:1:11101:7287:1044 405 uc008cdp.1|C4a 2814 0 * = 2814 0 ACCCTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN * NH:i:3
Picard crashes on the fourth record with error:
ERROR: Record 4, Read name NB501285:81:HVJFGBGX2:1:11101:7287:1044, Not primary alignment flag should not be set for unmapped read.
It looks like the issue is that Salmon is setting bit 4 (read unmapped) for the mates and is setting bit 256 (not primary alignment) for the secondary alignments. Both bits 4 and 256 are set for the mate of a secondary alignment which picard doesn't like.
@b-dawes Did you bring it to picard developers? I do not say salmon is not broken, but ... getting more opinions woul dbe welcome. @droazen Clues?
Hi @mmokrejs
I never did bring it up to the picard developers.
Hi Brian,
In general, I would argue that one should be cautious with removing PCR duplicates in RNA-seq data (unless you are dealing with reads with UMI tags). This is because reads that align to the same reference position can easily have come from alternative transcripts sharing the same underlying sequence. Hence, the normal tests used to infer PCR duplicates with e.g. DNA-seq reads can yield false-positives in RNA-seq. This is particularly true for highly abundant transcripts (or transcripts from highly-abundant genes).
We are currently working on the code that will do duplicate removal when UMI tags are present. That methodology can be extended to remove duplicates even without UMI tags --- though I'd generally caution against that for the reasons mentioned above. However, for the time being, if you have a strong need or desire to filter PCR duplicates, you could use alignment-based Salmon with a BAM file that has duplicates removed.
Finally, regarding the error you are getting during SAM validation; this sounds like a different issue. Would you mind providing a piece of that SAM file for me to take a look at? Specifically, I don't believe the quasi-mapping output file should even contain unmapped reads (unless you consider unmapped mates of orphaned reads).
--Rob
It is in the latest Salmon release?
Thanks
I'm curious if there is code available to remove the duplicates based on UMI tags for RNA. I'm aware of Alevin, but I don't think this suits my purpose
Regards, Steven
Hi, is there any news on removing PCR duplicates in salmon? I think this is a major fallacy of this software when using paired end reads or reads with UMIs. Sometimes PCR duplication can be quite different between replicates and with paired end read it should be quite widely accepted that those are real PCR duplicates and should be removed.
Thanks!
Hello,
My lab is in the process of testing out Salmon and potentially switching to it from traditional aligners. With a traditional aligner, we use picard's MarkDuplicates to remove PCR duplicates. Is there a way to remove PCR duplicates with Salmon? I tried using --writeMappings to generate a BAM and feed that into picard, but I just get a SAM validation error: "Not primary alignment flag should not be set for unmapped read."
Best, Brian