Closed kvn95ss closed 7 months ago
Personally, I feel that including reads that are not linked to UMI defeats the purpose of using a UMI. I don't remember the details for SMART-Seq3, but are UMIs attached before or after fragmentation? If the latter, than mark duplicates isn't going to be very useful.
I believe they are added after fragmentation, followed by amplification.
I would agree with you, but we had tiny amount of RNA to begin with, so we would not like to loose any information from those reads.
I also had another question - using the string
method treats the internal reads as UMIs as well, i.e trims the beginning of read using the --bc-pattern
. While this is 'wrong', we observed higher gene counts with this method as more reads were being retained, but can I assume this will not cause a sensible deduplication of these reads?
but can I assume this will not cause a sensible deduplication of these reads?
No, deduplication here will be entirely random.
If the UMIs are added after fragmentation, then deduplicating on position (such as with picard) will not be entirely random. But I can't speak to what will happen to quantification accuracy if you add two sets of reads, deduplicated in different ways, together.
No, deduplication here will be entirely random.
Would that necessarily be a bad thing?
Also, I am trying to process the data both ways, and plan to use QualiMap to check for transcript coverage (Smart-seq3 is supposed to have somewhat even coverage of transcripts). If there are any deviations I'll post it here.
One correction, the UMIs were added before fragmentation.
I removed the UMIs from reads containing them, used --filtered-out
to obtain the internal reads and finally combined the reads together, effectively removing the UMIs from the reads. The coverage across transcripts is reasonably even.
When only looking at reads with UMI, there is a strong 5' bias (I was told it was due to UMIs being at the 5' of in the fragments).
Hello!
When using
string
extract method orregex
method, both assume all reads are tagged with UMI. However, depending on the technology (Which is smart-seq3 in my case), there are internal reads without any UMIs.What would be the best way to include these reads in the analysis, as the internal reads can make up anywhere from 20% to 60% of the reads, ignoring them seems... wasteful.
Does the below approach work to incorporate internals?
regex
using the--filtered-out
optionOne problem would be, the internals might not be 'deduplicated' as perfectly as the UMI reads, so in downstream analysis some genes might have inflated counts. Apart from this issue, I can't think of any other downside, but any input is greatly appreciated.