CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
474 stars 188 forks source link

Steps for single cell (drop-seq) protocol #97

Closed Hoohm closed 6 years ago

Hoohm commented 7 years ago

Hello, I'm trying to use your tool to handle our single cell drop-seq data. The method is using paired reads but the pairing is not on the gene itself, meaning we're not doing paired end sequencing, but rather having the barcode + UMI on R1 and the sequence on the R2.

As an example: R1: @HISEQ:190:H5VNVBCXY:1:1101:1380:2019 1:N:0:TAAGGCGA NCTGACGCAAGGACAGGGAG +

<<GGGGIIIIGIIGGIAGG

R2: @HISEQ:190:H5VNVBCXY:1:1101:1380:2019 2:N:0:TAAGGCGA ATCTGCTCCTGGAGGTGGTAACAAGGTTCCACAGAAAAAAAGTAAAACTTGATGAAGATGATGAGGACGATGATGAGGACGATGAGGATGATGAGGATGA + AAGGGIIGGAGGGG<G.AGGIGIGAAGIIIGGGGGGGGAGIGIGGGIGGGGGGGGGGGGGIIGGGIIIIAGIGG.<G.GGGAGGGGGGGIAGGGAGGGGG

R1 is based on a 12bp of cell Barcode and a 8bp of UMI. The 12bp Barcode is "random" but constant for one cell. The 8bp UMI is random.

Would it be possible to use your tool to resolve the cell Barcodes as well as the UMI. The easiest way would maybe be to extract twice R1 with a paired option so that we append both the BC and the UMI in the name of the read in the R2. Then we could dedup the aligned file first on the BC and secondly on the UMI.

Please tell me if it seems doable and/or makes sense.

TomSmithCGAT commented 7 years ago

Hi @Hoohm. Arguably, you could use the a single barcode (cell + UMI) for deduplication as this would correct for errors in both the cell barcode and UMI simultaneously. However, unless I'm mistaken, the two barcodes are subtly different. Whilst both barcodes are random, the cell barcode should be one of n discrete sequence, where n is the number of cells sequenced. Due to sequencing errors (and possibly multiple cell barcodes in a single droplet for drop-seq?), the number of cell barcodes observed is > n. We would therefore like to correct the cell barcode against a whitelist of the real n barcodes. This is different to the UMIs where we don't have a whitelist of allowed sequences and hence we use networks to distinguish real UMIs from erroneous ones.

Is this the case with your drop-seq data too?

IanSudbery commented 7 years ago

I wasn't aware that drop-seq had white listed cell barcodes?

I would be templated to extract the full 20bp as a single barcode/UMI, perform the deduplication on this, and then split the barcode into cell barcode and umi. The only thing you might want to watch out for is that at 20bp, you might want up the edit_distance threshold. What do you think @TomSmithCGAT?

In theory, you could do the two separately, but what you'd need to do is this: 1) Deduplicate on gene + umi (method = unique) + cell barcode (method = directional) 2) Deduplicate the result of 1 on gene + umi (method= directional) + cell barcode(method = unique)

Currently we don't provide mechanisms to do this. (not saying its impossible, but its not in the current version, and would have to go on a to do list).

Hoohm commented 7 years ago

The way we consider valid cell barcodes is by checking the cumulative fraction of reads per cell barcodes on a so-called knee-plot. This allows to determine this white list of barcodes that you select as "whole cells". The "bend" is considered as the moment where the fraction of reads drops significantly and hence is not a whole cell anymore but rather floating RNA. (knee_plot attached)

We already have a method for doing this using a protocol given by the Macosko group, but I wanted to compare it with your tool. I've also tried some manual selection, with simple alignement, allowing one mismatch/deletion/insertion.

I could also use the whitelist from the other method and combine it with umi_tools for the UMI dedup.

2017-03-21 15:46 GMT+01:00 Tom Smith notifications@github.com:

Hi @Hoohm https://github.com/Hoohm. Arguably, you could use the a single barcode (cell + UMI) for deduplication as this would correct for errors in both the cell barcode and UMI simultaneously. However, unless I'm mistaken, the two barcodes are subtly different. Whilst both barcodes are random, the cell barcode should be one of n discrete sequence, where n is the number of cells sequenced. Due to sequencing errors (and possibly multiple cell barcodes in a single droplet for drop-seq?), the number of cell barcodes observed is > n. We would therefore like to correct the cell barcode against a whitelist of the real n barcodes. This is different to the UMIs where we don't have a whitelist of allowed sequences and hence we use networks to distinguish real UMIs from erroneous ones.

Is this the case with your drop-seq data too?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/97#issuecomment-288101242, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNXaD4YuHPoa5RmHRvnzRo5cS1EeYN5ks5rn-KsgaJpZM4MjpmK .

--

Roelli Patrick Division of Animal Physiology and Immunology TUM School of Life Sciences Weihenstephan Technische Universität München Weihenstephaner Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland

https://github.com/Hoohm https://github.com/Hoohm

TomSmithCGAT commented 7 years ago

Hi @Hoohm. The knee-plot approach sounds identical to the method used in the Klein et al inDrop paper which came out around the same time. When I re-analysed the inDrop data, I first made a cell barcode whitelist, then I parsed the raw data and split the reads into single end fastqs for each whitelisted cell barcode, discarding reads which did not match the whitelist. At the same time I corrected cell barcodes which were one mismatch away from a single whitelisted cell barcode, and extracted the UMIs and appended this to the read name. I think this approach would be the best way to go here too.

This is a relatively bespoke extraction procedure since it involves simulatenous cell barcode extraction and correction, raw data splitting and UMI extraction, hence it is not supported by UMI-tools extract. I'm happy to share my code for the above if you would like?

TomSmithCGAT commented 7 years ago

@IanSudbery Should we perhaps consider supporting a drop-seq/inDrop UMI extraction using the method above. We could leave it up to the user to generate a cell barcode white list as they wish. It wouldn't take much effort to add this to UMI-tools but my concern is how stable the drop-seq/inDrop methods are? However, presumably any droplet based method will have a similar issue with the cell barcodes?

IanSudbery commented 7 years ago

@TomSmithCGAT In the long run, I think we should provide the ability to pass dedup a tag that contains a cell barcode and this should be added to the bundle key so that splitting into separate files wasn't necessary.

I'm also interested in whether our methods could act as a replacement for the knee method. Either by deduping on the whole barcode (cell + UMI) or the two step porcess I outlined above.

TomSmithCGAT commented 7 years ago

@IanSudbery Good idea. How would you see this working generically with any aligner? We could fall back on the method of appending the cell barcode to the read id, i.e: @HISEQ:190:H5VNVBCXY:1:1101:1380:2019_AGACTAGGAG;AGCGAGCG where the barcodes are ;-separeted. If no cell barcodes are extracted, the cell barcode portion of the bundle key would be empty.

My guess is that the knee method should be near optimal but it would definitely be interesting to compare the results. One of the advantages of the knee method is that it guarantees the cell barcodes are the n most abundant even if this includes cell barcodes which are a single edit distance apart. This might not necessarily be the most accurate answer but it's perhaps more appealing than merging together two highly abundant cell barcodes which are 1 edit distance apart whilst retaining very lowly abundant cell barcode, as could occur with a network-based method. Merging together two highly abundant cell barcodes incorrectly could be lead to apparently interesting but ultimately erroneous findings from one's scRNA-Seq - I'm thinking 'dual-state cell' type findings arising from cell doublets.

An issue we would currently have with the network-based method is that the 'true' cell barcodes should be the same across the whole data set and we're currently treating each positions independently. We could solve this with a two pass method, e.g:

Pass 1: Tally counts for all cell barcodes across all reads, form networks and resolve to identify the n 'true' barcodes across sample, and crucially, the barcodes which are in the same cell group. Pass 2: Deduplicate per position, per cell group

For pass 1 I'd suggest the directional method, although the threshold may need to be stricter and/or the edit distance increased? Whichever method we use, we may need to also introduce a minimal counts threshold for a 'true' barcode to be accepted to avoid issues with very low abundance cell barcodes in a single node network.

Hoohm commented 7 years ago

I think you are right, merging two barcodes that are close but abundant should be avoided. Finding the top n cell barcodes for the whitelist should be used to collapse the rest of the reads but avoided inside the whitelist. Although, I'm pretty sure the chance of two high abundance read for two cell barcodes that are one bp away is pretty low. If it happens it might be the same, but this should be checked in post-processing. @TomSmithCGAT I would be really interested in the code you used.

2017-03-21 18:28 GMT+01:00 Tom Smith notifications@github.com:

@IanSudbery https://github.com/IanSudbery Good idea. How would you see this working generically with any aligner? We could fall back on the method of appending the cell barcode to the read id, i.e: @HISEQ:190:H5VNVBCXY:1:1101:1380:2019_AGACTAGGAG;AGCGAGCG where the barcodes are ;-separeted. If no cell barcodes are extracted, the cell barcode portion of the bundle key would be empty.

My guess is that the knee method should be near optimal but it would definitely be interesting to compare the results. One of the advantages of the knee method is that it guarantees the cell barcodes are the n most abundant even if this includes cell barcodes which are a single edit distance apart. This might not necessarily be the most accurate answer but it's perhaps more appealing than merging together two highly abundant cell barcodes which are 1 edit distance apart whilst retaining very lowly abundant cell barcode, as could occur with a network-based method. Merging together two highly abundant cell barcodes incorrectly could be lead to apparently interesting but ultimately erroneous findings from one's scRNA-Seq - I'm thinking 'dual-state cell' type findings arising from cell doublets.

An issue we would currently have with the network-based method is that the 'true' cell barcodes should be the same across the whole data set and we're currently treating each positions independently. We could solve this with a two pass method, e.g:

Pass 1: Tally counts for all cell barcodes across all reads, form networks and resolve to identify the n 'true' barcodes across sample, and crucially, the barcodes which are in the same cell group. Pass 2: Deduplicate per position, per cell group

For pass 1 I'd suggest the directional method, although the threshold may need to be stricter and/or the edit distance increased? Whichever method we use, we may need to also introduce a minimal counts threshold for a 'true' barcode to be accepted to avoid issues with very low abundance cell barcodes in a single node network.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/97#issuecomment-288155648, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNXaKVuIaSUOhwRQXEo8W0l815AhYLpks5roAjBgaJpZM4MjpmK .

--

Roelli Patrick Division of Animal Physiology and Immunology TUM School of Life Sciences Weihenstephan Technische Universität München Weihenstephaner Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland

https://github.com/Hoohm https://github.com/Hoohm

IanSudbery commented 7 years ago

I feel like the first step would be to allow dedup to do the per position, per cell group deduplication based on a user specified BAM tag. This step would be useful for pretty much any multiplexed single cell sequencing technique I reckon.

The next step would be to implement methods for getting that tag onto the BAM record. I think the would have to be some sort of extra field on the fastq record. Could be: @HISEQ:190:H5VNVBCXY:1:1101:1380:2019_<CELL_BC>_<UMI>

Thus the reads would still be compatible the current format where split("_")[-1] is taken to be the UMI.

Finally some method for collapsing cell barcodes. Is the "knee" for the "knee" automatically calculated, or done by hand? I suspect that this would be a separate tool. The "knee" method sounds remarkably similar to the "percentile" method for umi deduplicating (except applied across the whole sample rather than just at one position).

TomSmithCGAT commented 7 years ago

@IanSudbery There's an example of the 'knee' plot on page 12 of the drop-seq guide here. It appears that the 'knee' is not automatically calculated. I think the easiest way to do it automatically would be to make a density plot for counts per cell barcode and then find where the second derivative = 0. This would represent the point between the distribution of counts for the real cell barcodes and the errors.

TomSmithCGAT commented 7 years ago

@Hoohm. I'll dig out the relevant code tomorrow afternoon

Hoohm commented 7 years ago

@TomSmithCGAT That would be the easiest way to do it. Although, some people are struggling to have a clear signal and don't get this "bend" on the plot, making it complicated sometimes to define where the limit is. Automatic is fine, but there should be a possibility to decide manually.

TomSmithCGAT commented 7 years ago

@Hoohm I've attached a python function which will take the paired end fastqs from drop-seq and perform the UMI extraction. Github wont let me attach a .py file so I saved it as .txt: parse_dropseq.txt

This code is modified from the extractUMIsAndFilterFastqGSE65525 function in PipelineScRNASeq.py which I used to re-analyse the inDrop data for the UMI-tools publication.

b-dawes commented 7 years ago

Hello,

I just wanted to chime in that I'm also interested in applying UMI Tools to dropseq data. Have you made any progress on this front?

I'm using Dropseq Tools from the McCarroll lab (under "Computational resources" here) to process my data. The output of this will be a BAM with the cell barcode in the XC tag, UMI in the XM tag, and gene in the GE tag.

If I wanted to fix UMI errors within cell barcodes, I would first need to split my BAM into one BAM per cell barcode and then run a command like this for each BAM: umi_tools dedup -I in.bam --extract-umi-method tag --umi-tag XM --gene-tag=GE -S out.bam

Is this the correct way of doing this? Is there some way to prevent UMIs from different cells being compared against each other so that I do not have to split the BAM file?

Additionally, it would be very nice if I was able to use UMI Tools to correct cell barcode errors. For context, we generally use tens of thousands of beads but get millions of unique cell barcodes. The majority of these cell barcodes have only one or two reads in them but there's a significant population with many reads, presumably from PCR errors. Recovering these reads would be very helpful in further analysis. I know @TomSmithCGAT mentioned that he did this with indrop data, would you mind detailing how exactly you did this?

Hoohm commented 7 years ago

I haven't put more work into it right now I was busy pushing a new version of my dropseq pipeline https://github.com/Hoohm/dropSeqPipe.

I will add the UMI tools in there once I have tested it against the correction/fusion method proposed by the McCarroll lab. I think the dedup method is a good one, although there is one thing that might get you to loose some cell barcodes in the process. If I'm not mistaken, the DetectBeadSynthesisErrors is trying to correct for errors in the cell barcodes, so splitting your data before this step will make you loose some reads.

I would hope we can make use of UMI-tools to check for cell barcodes also, but I'm not sure it accounts for production biases as mentioned in the cookbook:

In June 2015, we noticed that a recently purchased batch of ChemGenes beads generated apopulation of cell barcodes (about 10­20%) with sequences that shared the first 11 bases, but differedat the last base. These same cell barcodes also had a very high percentage of the base “T” at the lastposition of the UMI. Based on these observations, we concluded that a percentage of beads in the lothad not undergone all twelve split­and­pool bases (perhaps they had stuck to some piece ofequipment or container, and the been re­introduced after the missing synthesis cycle). Thus, the20­bp Read 1 contained a mixed base at base 12 (in actuality, the first base of the UMI) and a fixedT­base at base 20 (in actuality, the first base of the polyT segment). I'll keep you posted when I tested it.

2017-05-03 21:38 GMT+02:00 b-dawes notifications@github.com:

Hello,

I just wanted to chime in that I'm also interested in applying UMI Tools to dropseq data. Have you made any progress on this front?

I'm using Dropseq Tools from the McCarroll lab (under "Computational resources" here http://mccarrolllab.com/dropseq/) to process my data. The output of this will be a BAM with the cell barcode in the XC tag, UMI in the XM tag, and gene in the GE tag.

If I wanted to fix UMI errors within cell barcodes, I would first need to split my BAM into one BAM per cell barcode and then run a command like this for each BAM: umi_tools dedup -I in.bam --extract-umi-method tag --umi-tag XM --gene-tag=GE -S out.bam

Is this the correct way of doing this? Is there some way to prevent UMIs from different cells being compared against each other so that I do not have to split the BAM file?

Additionally, it would be very nice if I was able to use UMI Tools to correct cell barcode errors. For context, we generally use tens of thousands of beads but get millions of unique cell barcodes. The majority of these cell barcodes have only one or two reads in them but there's a significant population with many reads, presumably from PCR errors. Recovering these reads would be very helpful in further analysis. I know @TomSmithCGAT https://github.com/TomSmithCGAT mentioned that he did this with indrop data, would you mind detailing how exactly you did this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/97#issuecomment-299013514, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNXaOxVBpQrqL_so_RdgUZQmMPslnQsks5r2NfGgaJpZM4MjpmK .

--

Roelli Patrick Division of Animal Physiology and Immunology TUM School of Life Sciences Weihenstephan Technische Universität München Weihenstephaner Berg 3 85354 Freising GermanyBCF, Swiss Institute of BioinformaticsBâtiment Génopode, Quartier SorgeUniversity of Lausanne1015 LausanneSwitzerland

https://github.com/Hoohm https://github.com/Hoohm

IanSudbery commented 7 years ago

As far as I can tell, currently the McCarroll pipeline maps to the genome, while at the moment we only support per_gene deduplicating if reads are mapped to the transcriptome. We are working on ways around this that don't involve a large increase in memory usage (the problem is we never know when we've seen all the reads associated with a gene and so don't know when to do the UMI collapsing. This means we have to remember the entire file and do all the dedupling at the end, which is very memory intensive).

At the moment it is also necessary to split the file into one file per cell. We are working on ways around this as this shouldn't be too hard to fix.

Look out for all these things in the next major feature (i.e. non-bugfix) release.

TomSmithCGAT commented 7 years ago

Hi @b-dawes. I just like to echo @IanSudbery's comments. We're actively working on implementing an approach whereby you can retain the reads from all cells in a single BAM and use UMI-tools to both error correct the cell barcode and deduplicate the reads in an error-aware manner.

In answer to your question about indrop data, as you'll see from my comment above yours, I had to do this in a more laborious, naive manner which involved knowing the number of cell barcodes beforehand, splitting the reads into one file per cell barcode and then deduplicating.

We'll update this issue when we have a better solution

Hoohm commented 6 years ago

Hello,

I'm looking for potential updates on UMI-tools to work with drop-seq/10x/etc protocols. I would be keen on trying any new implementation you are still testing out.

IanSudbery commented 6 years ago

Hi @Hoohm, you can checkout the {ts}_support_cell_barcodes branch, which we give no guarantees on at the moment, but is pretty close to ready and should work as far as we know.

We are also working on a tutorial for analysing 10x/drop-seq/inDrop using umi_tools.

Hoohm commented 6 years ago

@IanSudbery Cool, I'll look into it then!

Hoohm commented 6 years ago

Hey, I'm already coming back with one question. If I run a partial run (first 400 000) lines, everything works fine. If I run the whole file (~100'000'000 reads) I get this error:

TypeError: 'NoneType' object is not iterable

umi_methods.py", line 330, in getErrorCorrectMapping
    whitelist = set([str(x).encode("utf-8") for x in whitelist]

So, I guess the whitelist is empty for some reason. I'll test it out a bit more, but if you have an idea, I'm listening.

Hoohm commented 6 years ago

Ok, same error with 40 000 000 but it works with 20 000 000. The local_min is not found for some reason and it returns None, hence it crashes. Here are a few values I printed out while testing:

In the meantime I'll use the cell threshold given by a partial run and run it again with it as cell_count.

IanSudbery commented 6 years ago

Wow. That is a lot of reads!!!

I'm guessing its not finding a local minima. @TomSmithCGAT Might now more about this than me, but on those error runs, did it get as far as outputting the distributional plot? (You'll need to have run it with --plot-prefix).

What sort of order of magnitude of cells were you expecting?

I suspect we need to catch this and tell users that they need to examine the plots and pick a threshold manually.

TomSmithCGAT commented 6 years ago

Hi @Hoohm - Thanks for reporting the error.

The selection of the local minima is currently a bit dirty - since there are often local minima in the region of very high read/UMI counts per cell, there are some cut-offs to make sure these minima are rejected. What's happened here is that the local minima that were detected (194 & 898) were both rejected.

I'm finding that these manual cut-offs are not suitable for across all data sets (and in your case, different numbers of input reads) so we'll will probably need to try and alternative approach. In the mean time, I'll make a patch to the branch so that the counts for the local mins are still output even when the whitelist generation fails. Following @IanSudbery's suggestion, you can then run whitelist with --plot-prefix to obtain the counts and then re-run whitelist with the --set-cell-number option to manually set the number of accepted cell barcodes.

Alternatively, you can use the whitelist generated using the first 20M reads as is plenty to identify the "true" cell barcodes (you can use the --subset-reads option if you aren't doing so already). I'd definitely recommend visually assessing the plots to check you are happy that the number of cell barcodes detected is reasonable though?

Apologies for the inelegant solutions.

TomSmithCGAT commented 6 years ago

Just to clarify, the local_min values relate to the x-axis on the density graph, not the # cell barcodes or # reads/UMIs per barcode - the values are meaningless by themselves without reference to the distribution of counts per cell barcode. threshold is the minimum number of reads/UMIs per cell barcode for it to be included in the density plot.

TomSmithCGAT commented 6 years ago

@Hoohm - If you want to run whitelist and inspect the plots this is now possible even where the selection of the local minima fails (143f1ac0c6312863421c8e4eab6c06a715a36619). The cell-barcodes branch has now also been merged into the master branch so al the single cell functionalities are available on the master branch.

Hoohm commented 6 years ago

Perfect, I have a new run to test it on :) @TomSmithCGAT I think the idea of running the sample once with a subset of reads to create an estimated number is great. I'll do that and might add a 5% on top just to be sure. I'll filter them out later on downstream analysis. @IanSudbery If I remember correctly, the plot didn't work either on the whole run. We expected from 4 to 5k cells.

IanSudbery commented 6 years ago

Just in case you don't already know, you can use the --subset-reads option to use only a sub-sample of reads to generate the whitelist.

TomSmithCGAT commented 6 years ago

@Hoohm - Until 143f1ac the plotting would have failed whenever the whitelist generation failed. That shouldn't be the case any more

TomSmithCGAT commented 6 years ago

Hi @Hoohm. Just to let you know that v0.5 is now out, including a guide on using UMI-tools for scRNA-Seq: https://github.com/CGATOxford/UMI-tools/blob/master/doc/Single_cell_tutorial.md. I'll close this issue now but feel free to re-open if you have any more questions for your analysis