Q: Handling UMI-only reads

dhoconno commented 7 years ago

Thanks for making this very useful set of tools. I have a situation that doesn’t seem to be fully addressed in the documentation. We have cDNA PCR amplicons where the cDNA sequences have been tagged with a 3’ UMI that is 8nt in length. I am able to use UMI_extract to annotate the read name with the UMI sequence and remove the UMI from the sequence itself. There are two other tasks that I think the tools can accomplish, but I’m struggling to figure out how:

1) Restrict analysis to N most common UMIs. Say I want to create a whitelist containing the 100 most commonly occurring UMIs in my dataset so I can focus on those that have the best sequence support. Is there a good way to do this?

2) Deduplication to consensus. In a situation where I have 300 reads with the same UMI, I want to deduplicate to a single consensus sequence for all reads containing the same UMI. Ideally, I’d like to exclude UMIs where read support for the consensus is below a certain threshold (say, 30 of the 300 reads do not match the consensus).

Because our cDNA amplicons are short, we expect most authentic reads to be identical but as the number of reads per UMI increase, the likelihood of having one or more read that are mismatched relative to the consensus.

Sorry if this should be clear from the documentation, but I tried to use the tools to accomplish these tasks but struggled.

Thanks in advance for any advice you can provide,

Dave

TomSmithCGAT commented 7 years ago

Hi @dhoconno. You're right that your analysis isn't fully covered in the documentation. However, we have at least two other users using UMI-tools to derive consensus sequences so we are considering adding this functionality into UMI-tools (see #181). For the time-being you will need to perform a bespoke analysis.

In answer to your questions:

We don't currently support this but you should be able to do this easily yourself if you take the approach outlined below
In order for UMI-tools to identify the duplicate reads, the reads need to be aligned to the reference sequence. After umi_tools extract, you can either align to the whole genome, which is probably overkill in this case, or build a fasta containing the reference sequence(s) and align to this. Depending on how much you expect your consensus to vary from the reference sequence, you may need to tweak the options for the aligner to allow a higher error rate. You can then use umi_tools group to identify the duplicate reads and output this as a BAM (--output-bam) and/or flatfile (--group-out) for further processing however you wish. How you perform this final step will be up to you but this should be straightforward to do in python/perl/R/bash etc. At this stage, you could also implement a filter to only consider the top 100 UMIs as you suggest in 1.

Let me know if any of that isn't clearly explained.

IanSudbery commented 7 years ago

Just to add to this: You appear to want to filter the reads the to top 100M to only select those that are "well supported". This is similar to the "percentile" method we refer to in the publication. We find there that this is not a good way to distinguish UMIs sequences that genuinely existed as oligos added to your DNA molecules from those that arise through various errors and inaccuracies in the amplification and sequencing process. Our directional method performs much better at identifying the groups of reads originating from the same molecule.

dhoconno commented 7 years ago

Thanks Ian, this is really helpful. I’ll try using umi_tools group and bamtools to filter and interact with mapped reads. I’m not sure how others would do this, but I will likely collect all reads with the same UMI and use a clustering tool (likely UCLUST or clumpify) to derive a consensus sequence for each UMI. Doing this with another tool isn’t problematic at all; I just wanted to make sure the functionality didn’t already exist from within umi_tools.

Thanks again for your help,

Dave

CGATOxford / UMI-tools

Q: Handling UMI-only reads #194