Closed achamess closed 7 years ago
From what I understand of have a UMI on read1 and the mouse sequence on read2? If thats the case, then yes, the best way to proceed is to extract the UMI from read1 and add it to the name of read2.
umi_tools
extract will do this if you add --read2-in
and --read-out
options to the call. By default it is assumed that the UMI is only found on read one. Read1, trimmed of the UMI will be output, but you can just discard this and only map read2, which should now have the UMI as part of its read name.
@IanSudbery Thanks for the quick response! That's promising. Glad to hear this should work out. However, R1 also has an important barcode seq contiguous to the UMI. That information is critical too for one more round of demultiplexing (cellular index = i5 + i7 + the R1 barcode). Can I also attach that along with the UMI? Any thoughts on tools to demultiplex based on that last R1 barcode?
If you want to demultiplex to seperate files here is what I would do:
extract UMIs using UMI-tools. If you get the pattern right, the UMI should be removed and added to R1 and R2 names, but the barcode left in place on read1. Next use a paired end aware demultiplexer to demultiplex files based on the barcode - I've used reaper to do this before ( http://www.ebi.ac.uk/research/enright/software/kraken) Discard whatever remains of read1 and map only read2 De-duplicate with UMI-tools.
On Fri, 7 Apr 2017 at 14:58 achamess notifications@github.com wrote:
@IanSudbery https://github.com/IanSudbery Thanks for the quick response! That's promising. Glad to hear this should work out. However, R1 also has an important barcode seq contiguous to the UMI. That information is critical too for one more round of demultiplexing (cellular index = i5 + i7 + the R1 barcode). Can I also attach that along with the UMI? Any thoughts on tools to demultiplex based on that last R1 barcode?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/102#issuecomment-292543655, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjt1wD1CRcykj76yrz6h4oMqBPC1Vks5rtkDmgaJpZM4M26sX .
Thanks @IanSudbery! Sounds like a plan. I’ll give it a try.
On Apr 7, 2017, at 10:06 AM, Ian Sudbery notifications@github.com wrote:
If you want to demultiplex to seperate files here is what I would do:
extract UMIs using UMI-tools. If you get the pattern right, the UMI should be removed and added to R1 and R2 names, but the barcode left in place on read1. Next use a paired end aware demultiplexer to demultiplex files based on the barcode - I've used reaper to do this before ( http://www.ebi.ac.uk/research/enright/software/kraken) Discard whatever remains of read1 and map only read2 De-duplicate with UMI-tools.
On Fri, 7 Apr 2017 at 14:58 achamess notifications@github.com wrote:
@IanSudbery https://github.com/IanSudbery Thanks for the quick response! That's promising. Glad to hear this should work out. However, R1 also has an important barcode seq contiguous to the UMI. That information is critical too for one more round of demultiplexing (cellular index = i5 + i7 + the R1 barcode). Can I also attach that along with the UMI? Any thoughts on tools to demultiplex based on that last R1 barcode?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/102#issuecomment-292543655, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjt1wD1CRcykj76yrz6h4oMqBPC1Vks5rtkDmgaJpZM4M26sX .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/102#issuecomment-292545849, or mute the thread https://github.com/notifications/unsubscribe-auth/AHGbcnMgyfNtEiwqt8dX6O4gctp2nBZdks5rtkLfgaJpZM4M26sX.
Hey. I wanted to follow up. In the end, I made a pipeline of the following:
extract
using default parameters and my UMI patternReaper
to demultiplex based on RT barcode dedup
to remove duplicatesSo far so good. Thanks for making UMI tools. It was the right tool at exactly the right moment for me. I lose a ton of reads (like 60-80% are duplicates), but that not the tool's fault. That's just the reality of single cell.
Glad you found a way to make it work for you!
Hi again. Do you have any thoughts on the best way to handle data from multiple lanes? For this sample, we have run two Nextseq lanes from the same library pool. So these are technical replicates of each other. I know that the data need to be merged at some point. Right now, for each lane's data, I did parallel steps. QC -> UMI extract -> STAR. With the output bam files, I then merged the bam files using samtools before running umi dedup
. It seems important to merge before dedup so that the dedup tool recognizes duplicates that are present on both lanes. Are there any parameter changes I should make to the tool? Is there a superior approach?
Its definitely important to merge the BAMs before you dedup.The reads should retain their read groups through UMI-tools, so if you wanted to examine lane specific effects, you should add read groups to the BAMs before merging, and then you could look at things like read group specific counts afterwards if that bothered you.
Did you get this working?
Hi Ian, Yes, it worked. Or at least I think it did. The major conclusion from this was that there are very diminishing returns for deeper sequencing in our case. We found that 60-80% of our reads are removed as duplicates using the default UMI counts settings. That's a ton, but probably accurate. Just the nature of single cell stuff I think. Do you have any thoughts about using the other settings? We're doing single-nucleus RNA-seq using mouse neurons. Median UMIs are about 3000 when all is said and done.
Do you mean that the median UMI is present in 3000 copys or that the median gene has 3000 UMIs?
The only settings that are really going to make a difference for your are the method and the edit distance calculation: at very high saturation positions (i.e. nearly all the UMIs are used), there are some minor benefits to using the adjacency method over the directional method. Of course this will be offset by less good performance at less saturated positions.
It is also possible in very long UMIs that you get two mutations in an UMI simultaneously, thus making a 2nt edit distance more suitable. But we only really find this with very long UMIs, and definitely not with 8mers.
Hi, I mean that the UMI median for each cell is 3000. That's after dedup. I don't doubt that. It's probably right. I was just surprised about how much gets removed pre and post dedup.
Sent from my iPhone
On May 8, 2017, at 5:19 AM, Ian Sudbery notifications@github.com wrote:
Do you mean that the median UMI is present in 3000 copys or that the median gene has 3000 UMIs?
The only settings that are really going to make a difference for your are the method and the edit distance calculation: at very high saturation positions (i.e. nearly all the UMIs are used), there are some minor benefits to using the adjacency method over the directional method. Of course this will be offset by less good performance at less saturated positions.
It is also possible in very long UMIs that you get two mutations in an UMI simultaneously, thus making a 2nt edit distance more suitable. But we only really find this with very long UMIs, and definitely not with 8mers.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
Hi @achamess. Just to let you know that v0.5 is now out, including a guide on using UMI-tools for scRNA-Seq: https://github.com/CGATOxford/UMI-tools/blob/master/doc/Single_cell_tutorial.md. I'll close this issue now but feel free to re-open if you have any more questions for your analysis
Hi. Thanks for creating this tool. I'm trying to do an analysis from a pre-print on some scRNA-seq data and I think UMI Tools might be appropriate.
Here is the paper: http://biorxiv.org/content/early/2017/02/02/104844
Basically, there is a P5 adaptor, i5 index, UMI (8)+Barcode (10) in the R1 file (18 bp) and then from the reverse side a P7 adaptor and the actual cDNA insert read (52 bp)
We ran a Nextseq 500 75 BP PE run, and used bcl2fastq to demultiplex, and we see the expected i5/i7 combos. So we have 120 fastq files that have been demultiplexed once. Now we need to do the remainder of the pipeline from the paper:
Could I use UMI tools here? The R1 read has no sequence info that will align (it's all synthetic) to the mouse genome, so it would make sense to consolidate and add that info somehow to the informative R2. I'm at a loss as to which tools/pipeline I use should use to proceed. Any guidance would be great, and if UMI tools is not the right choice, what would I use?
Thanks so much.