CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
480 stars 190 forks source link

UMI Tools for sci-RNA-seq analysis #102

Closed achamess closed 7 years ago

achamess commented 7 years ago

Hi. Thanks for creating this tool. I'm trying to do an analysis from a pre-print on some scRNA-seq data and I think UMI Tools might be appropriate.

Here is the paper: http://biorxiv.org/content/early/2017/02/02/104844

Basically, there is a P5 adaptor, i5 index, UMI (8)+Barcode (10) in the R1 file (18 bp) and then from the reverse side a P7 adaptor and the actual cDNA insert read (52 bp)

We ran a Nextseq 500 75 BP PE run, and used bcl2fastq to demultiplex, and we see the expected i5/i7 combos. So we have 120 fastq files that have been demultiplexed once. Now we need to do the remainder of the pipeline from the paper:

Base calls were converted to fastq format and demultiplexed using Illumina’s bcl2fastq/ 2.16.0.10 tolerating one mismatched base in barcodes (edit distance (ED) < 2). Demultiplexed reads were then adaptor clipped using trim_galore/0.4.1 with default settings. Trimmed reads were mapped to the human reference genome (hg19), mouse reference genome (mm10), C.elegans reference genome (PRJNA13758) or a chimeric reference genome of hg19, mm10 and PRJNA13758, using STAR/v 2.5.2b (38) with default settings and gene annotations (GENCODE V19 for human; GENCODE VM11 for mouse, WormBase PRJNA13758.WS253.canonical_gene set for C.elegans). Uniquely mapping reads were extracted, and duplicates were removed using the unique molecular identifier (UMI) sequence (ED < 2, including insertions and deletions), reverse transcription (RT) index, and read 2 end- coordinate (i.e. reads with identical UMI, RT index, and tagmentation site were considered duplicates). Finally, mapped reads were split into constituent cellular indices by further demultiplexing reads using the RT index (ED < 2, including insertions and deletions). For mixed-species experiment, the percentage of uniquely mapping reads for genomes of each species was calculated.

Could I use UMI tools here? The R1 read has no sequence info that will align (it's all synthetic) to the mouse genome, so it would make sense to consolidate and add that info somehow to the informative R2. I'm at a loss as to which tools/pipeline I use should use to proceed. Any guidance would be great, and if UMI tools is not the right choice, what would I use?

Thanks so much.

IanSudbery commented 7 years ago

From what I understand of have a UMI on read1 and the mouse sequence on read2? If thats the case, then yes, the best way to proceed is to extract the UMI from read1 and add it to the name of read2.

umi_tools extract will do this if you add --read2-in and --read-out options to the call. By default it is assumed that the UMI is only found on read one. Read1, trimmed of the UMI will be output, but you can just discard this and only map read2, which should now have the UMI as part of its read name.

achamess commented 7 years ago

@IanSudbery Thanks for the quick response! That's promising. Glad to hear this should work out. However, R1 also has an important barcode seq contiguous to the UMI. That information is critical too for one more round of demultiplexing (cellular index = i5 + i7 + the R1 barcode). Can I also attach that along with the UMI? Any thoughts on tools to demultiplex based on that last R1 barcode?

IanSudbery commented 7 years ago

If you want to demultiplex to seperate files here is what I would do:

extract UMIs using UMI-tools. If you get the pattern right, the UMI should be removed and added to R1 and R2 names, but the barcode left in place on read1. Next use a paired end aware demultiplexer to demultiplex files based on the barcode - I've used reaper to do this before ( http://www.ebi.ac.uk/research/enright/software/kraken) Discard whatever remains of read1 and map only read2 De-duplicate with UMI-tools.

On Fri, 7 Apr 2017 at 14:58 achamess notifications@github.com wrote:

@IanSudbery https://github.com/IanSudbery Thanks for the quick response! That's promising. Glad to hear this should work out. However, R1 also has an important barcode seq contiguous to the UMI. That information is critical too for one more round of demultiplexing (cellular index = i5 + i7 + the R1 barcode). Can I also attach that along with the UMI? Any thoughts on tools to demultiplex based on that last R1 barcode?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/102#issuecomment-292543655, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjt1wD1CRcykj76yrz6h4oMqBPC1Vks5rtkDmgaJpZM4M26sX .

achamess commented 7 years ago

Thanks @IanSudbery! Sounds like a plan. I’ll give it a try.

On Apr 7, 2017, at 10:06 AM, Ian Sudbery notifications@github.com wrote:

If you want to demultiplex to seperate files here is what I would do:

extract UMIs using UMI-tools. If you get the pattern right, the UMI should be removed and added to R1 and R2 names, but the barcode left in place on read1. Next use a paired end aware demultiplexer to demultiplex files based on the barcode - I've used reaper to do this before ( http://www.ebi.ac.uk/research/enright/software/kraken) Discard whatever remains of read1 and map only read2 De-duplicate with UMI-tools.

On Fri, 7 Apr 2017 at 14:58 achamess notifications@github.com wrote:

@IanSudbery https://github.com/IanSudbery Thanks for the quick response! That's promising. Glad to hear this should work out. However, R1 also has an important barcode seq contiguous to the UMI. That information is critical too for one more round of demultiplexing (cellular index = i5 + i7 + the R1 barcode). Can I also attach that along with the UMI? Any thoughts on tools to demultiplex based on that last R1 barcode?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/102#issuecomment-292543655, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjt1wD1CRcykj76yrz6h4oMqBPC1Vks5rtkDmgaJpZM4M26sX .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/102#issuecomment-292545849, or mute the thread https://github.com/notifications/unsubscribe-auth/AHGbcnMgyfNtEiwqt8dX6O4gctp2nBZdks5rtkLfgaJpZM4M26sX.

achamess commented 7 years ago

Hey. I wanted to follow up. In the end, I made a pipeline of the following:

  1. Trimmomatic to do paired-end trimming/QC
  2. Join R1 (barcode + UMI) with R2 (cDNA) and then only work with the merged R1+R2 read in subsequent steps
  3. UMI tools extract using default parameters and my UMI pattern
  4. Reaper to demultiplex based on RT barcode
  5. STAR to align
  6. UMI tools dedup to remove duplicates
  7. HTseq-count to summarize

So far so good. Thanks for making UMI tools. It was the right tool at exactly the right moment for me. I lose a ton of reads (like 60-80% are duplicates), but that not the tool's fault. That's just the reality of single cell.

IanSudbery commented 7 years ago

Glad you found a way to make it work for you!

achamess commented 7 years ago

Hi again. Do you have any thoughts on the best way to handle data from multiple lanes? For this sample, we have run two Nextseq lanes from the same library pool. So these are technical replicates of each other. I know that the data need to be merged at some point. Right now, for each lane's data, I did parallel steps. QC -> UMI extract -> STAR. With the output bam files, I then merged the bam files using samtools before running umi dedup. It seems important to merge before dedup so that the dedup tool recognizes duplicates that are present on both lanes. Are there any parameter changes I should make to the tool? Is there a superior approach?

IanSudbery commented 7 years ago

Its definitely important to merge the BAMs before you dedup.The reads should retain their read groups through UMI-tools, so if you wanted to examine lane specific effects, you should add read groups to the BAMs before merging, and then you could look at things like read group specific counts afterwards if that bothered you.

IanSudbery commented 7 years ago

Did you get this working?

achamess commented 7 years ago

Hi Ian, Yes, it worked. Or at least I think it did. The major conclusion from this was that there are very diminishing returns for deeper sequencing in our case. We found that 60-80% of our reads are removed as duplicates using the default UMI counts settings. That's a ton, but probably accurate. Just the nature of single cell stuff I think. Do you have any thoughts about using the other settings? We're doing single-nucleus RNA-seq using mouse neurons. Median UMIs are about 3000 when all is said and done.

IanSudbery commented 7 years ago

Do you mean that the median UMI is present in 3000 copys or that the median gene has 3000 UMIs?

The only settings that are really going to make a difference for your are the method and the edit distance calculation: at very high saturation positions (i.e. nearly all the UMIs are used), there are some minor benefits to using the adjacency method over the directional method. Of course this will be offset by less good performance at less saturated positions.

It is also possible in very long UMIs that you get two mutations in an UMI simultaneously, thus making a 2nt edit distance more suitable. But we only really find this with very long UMIs, and definitely not with 8mers.

achamess commented 7 years ago

Hi, I mean that the UMI median for each cell is 3000. That's after dedup. I don't doubt that. It's probably right. I was just surprised about how much gets removed pre and post dedup.

Sent from my iPhone

On May 8, 2017, at 5:19 AM, Ian Sudbery notifications@github.com wrote:

Do you mean that the median UMI is present in 3000 copys or that the median gene has 3000 UMIs?

The only settings that are really going to make a difference for your are the method and the edit distance calculation: at very high saturation positions (i.e. nearly all the UMIs are used), there are some minor benefits to using the adjacency method over the directional method. Of course this will be offset by less good performance at less saturated positions.

It is also possible in very long UMIs that you get two mutations in an UMI simultaneously, thus making a 2nt edit distance more suitable. But we only really find this with very long UMIs, and definitely not with 8mers.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

TomSmithCGAT commented 7 years ago

Hi @achamess. Just to let you know that v0.5 is now out, including a guide on using UMI-tools for scRNA-Seq: https://github.com/CGATOxford/UMI-tools/blob/master/doc/Single_cell_tutorial.md. I'll close this issue now but feel free to re-open if you have any more questions for your analysis