Estimate correct number of real cells (barcodes)

CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets

MIT License

491 stars 190 forks source link

Estimate correct number of real cells (barcodes) #302

Closed fede86 closed 4 years ago

fede86 commented 5 years ago

First of all thank you for your great tool, which I am using to analyse DropSeq data (barcode length of 12, 100 Million reads). I followed basically the standard protocol, however, further downstream in my analysis (Seurat R package) I realized that something looks odd about my data as I basically see a lot of "noise" in the data and I suspect that it is maybe related to the correct barcode (BC) extraction in the beginning of the analysis with umi-tools. The problem is that we don’t really know how many cells we have in our experiment, we expected maybe around 5000 (not sure), but when I try to estimate the number of barcodes manually to find the best cell threshold, the number is actually around 30 000 (!) and this fits also clear with the knee in the plot. So I am not sure why this number is the huge and I guess maybe some BC are separated in the analysis but actually belong to the same cell. Also most of those 30 000 BC have less than 1000 associated reads and therefore cant be later not really used. My first question is therefore how reliable this estimate is and if it takes mismatches in the BC already into account. An alternative explanation is maybe that we simply have a lot of noise in our experiment because a lot of beads that didn’t paired with a cell get sequenced or that due to a high read depth we get more erroneous BC reads.

So if I then use the standard parameters (default edit distance of BC = 1) to create the whitelist, I get those expected table with 30000 accepted BC and in the next column other BC of distance 1 to that accepted BC. Now I was wondering why I also find pairs of accepted BC that have a distance of 1 to each other, shouldn’t those be pooled together actually or is this somehow dependent on the number of occurrences of that specific BC as its not expected to have the same erroneous BC hundreds of times as an error on the BC should be rather "random"? This raises also the question to me what happens to BC that are maybe of distance 1 to more than 1 accepted BC and how this relates when I increase the edit distance.

Also, when I tried to increase the edit distance to 2 in the whitelist step via "--error-correct-threshold=2", the script needs already 5h instead of 10 min (for distance 1) and with a distance of 3 it seems stuck and never actually finishes creating that whitelist even if I let it run over a few days, but probably a distance of 3 is anyway way too high for a BC length of 12.

Thank you in advance and best wishes, Fede

IanSudbery commented 5 years ago

Hi Fede,

Would it be possible to see the knee plot in question? 30 000 BC passing threshold sounds like too many to me, particularly if there are only 1000 reads associated with each cell.

Now I was wondering why I also find pairs of accepted BC that have a distance of 1 to each other, shouldn’t those be pooled together

They shouldn't be pooled together, but one accepted BC shouldn't appear as an error possiblity of another. The process goes like:

Identify accepted BCs.
Find all non-accepted BCs within 1 edit of one and only one accepted BC to mark for correction in the extract phase.

Perhaps @TomSmithCGAT can shed some light?

In terms of the error correction threshold - with a 12 barcode, each increase of the edit-distance threshold will multiple the time needed for constructing the table by 12*3, so I can well believe the jump from 10 minutes to 5hrs. I think that you are probably fine with an edit distance of 1, and although there may be some true cell barcodes at edit distance 2, it gets increasingly difficult to disentangle and the number should be small.

TomSmithCGAT commented 5 years ago

Hi @fede86. I strongly suspect that the knee threshold is incorrect and would also be interested to see the plots. Unfortunately, the "knee" method in UMI-tools is not 100% reliable and updating this has long been on the "to-do" list. As a fallback, one can inspect the knee plot and then re-run and specify the number of cells using the --set-cell-number.

As @IanSudbery points out, the identification of BCs with an error only occurs for BCs below the knee threshold so I expect that many of the accepted BC within one edit distance of another accepted BC will be low frequency BCs. I agree with @IanSudbery that an edit distance of 1 with a 12 base barcode seems perfectly reasonable and there will be very few reads with >1 error in the barcode. Since you have already run allowing 1 or 2 errors, you can always parse the whitelist outputs to see how many more reads were retained allowing 2 errors if you are interested in this.

fede86 commented 5 years ago

Hi @IanSudbery and @TomSmithCGAT and thank you very much for your super quick reply. I attached the generated plots. I might have to add that I run as suggested the analysis twice: initially four possible cutoffs are identified/rejected, but that I then choose the 27935 based on the knee plot and re-run it using --set-cell-number.

count action 32 Rejected 76 Rejected 27053 Rejected 27935 Rejected

I then read in the BC of the whitelist to filter the ones with low linguistic complexity (such as AAAAAAAAAAAA) and there I realized that e.g. if I just look at the first 1000 (of 27935) accepted barcodes in the whitelist, I find 244 pairs of barcodes with an edit distance of just 1, the first occurence as an example (and they show rather high frequencies of >1000 and should not be below the defined cutoff):

AAAAAGGCGGAA AAAAAAGCGGAA,AAAAACGCGGAA,AAAAAGACGGAA,AAAAAGGAGGAA,AAAAAGGCAGAA,AAAAAGGCCGAA,AAAAAGGCGAAA,AAAAAGGCGCAA,AAAAAGGCGGCA,AAAAAGGCGGGA,AAAAAGGCGTAA,AAAAAGGCTGAA,AAAAAGGTGGAA,AAAAAGNCGGAA,AAAAAGTCGGAA,AAAAATGCGGAA,AAAACGGCGGAA,AAAAGGGCGGAA,AAACAGGCGGAA,AAAGAGGCGGAA,AAANAGGCGGAA,AACAAGGCGGAA,AGAAAGGCGGAA,ANAAAGGCGGAA,GAAAAGGCGGAA,NAAAAGGCGGAA,TAAAAGGCGGAA 1684 10,1,3,1,6,1,8,1,2,3,1,3,6,1,1,1,54,2,1,1,1,2,1,3,4,7,1

AAAAAGGCGGAC AAAAAAGCGGAC,AAAAAGACGGAC,AAAAAGGCAGAC,AAAAAGGCGAAC,AAAAAGGCGGCC,AAAAAGGCGTAC,AAAAAGGTGGAC,AAAACGGCGGAC,ACAAAGGCGGAC,ANAAAGGCGGAC,GAAAAGGCGGAC,NAAAAGGCGGAC,TAAAAGGCGGAC 1005 3,2,2,2,2,2,1,27,4,4,1,5,1

I also assumed that a cutoff of 1 seems reasonable, but I am still trying to investigate if we have a "problem" in the experiment or the analysis and obviously we hope for the second option :). The number of reads per cell should be much higher.

Just for my understanding: If I lower the threshold e.g. to 5000 cells, is the result not only a subset of the 27935 list? Or would it probably find a lot more non-accepted BC and assign it to the less number of accepted ones and therefore increase the number of reads per cell? And if the second option is true, how do I identify a "proper" threshold if not based on the knee threshold?

Thank you very much again!

_cell_barcode_knee

_cell_barcode_counts

_cell_barcode_count_density

TomSmithCGAT commented 5 years ago

Hi @fede86. Thanks for the additional information. From the knee plots, it looks like a threshold ~25,000 cells is correct so 27,935 doesn't look too far off at all. Of course, a more relaxed threshold will ensure you capture all cells and you can then perform additional filtering downstream to remove any potential empty droplets. Personally, I would treat the knee threshold as the initial filter for likely empty cells and use downstream analyses to identify possible "low-quality" CBs which were just above the threshold. This is the approach implemented in alevin (which we played a small part developing) https://www.biorxiv.org/content/early/2018/10/24/335000.

umi_tools whitelist will only process the first 100M reads by default. You can change this using the --subset-reads option. At the bottom of the logfile, it will output the number of parsed reads, and the number which match the regex barcode. Assuming you processed the default 100M reads then a mean of 1000 reads per cell for ~30000 cells = ~30/100M reads matching the selected CBs, which doesn't seem too bad, considering these are only the reads with exact matches to the selected CBs. You can identify the precise number of reads from the selected CBs using awk '{sum+=$3} END {print sum}' [umi_tools_whitelist], but this won't capture the number of reads which matched with an error. If you want that number you would have to parse the additional column in the output which gives the counts for each error barcode. In hindsight, it would be useful to output this information in the log too so we'll add that for the next release.

The low edit distance between the selected 27935 CBs does seem a bit concerning. I guess one approach would be to confirm that the expression estimates for pairs of CBs within one edit distance are no more correlated than random pairs of CBs. That would give you confidence that they really are unique cells. Of course, there would still be an issue with a small proportion of reads being missassigned due to a single base call error so you might want to estimate the frequency which that could occur.

IanSudbery commented 5 years ago

The threshold here appears to be at around 100 reads per cell no?

I would be inclined to use a more stringent threshold. You are correct that if you use a higher threshold, then some of the cells currently selected as accepted will then be merged into other cells. This may or may not be a good thing - you dn't want o create artifical doublets.

Tom is right that alevin has a much more intelligent system for whitelisting, and includes our UMI deduplication algorithms. And I think it works with DropSeq?

TomSmithCGAT commented 5 years ago

The threshold is ~100 reads per cell , but the mode is ~1000 which may well be correct if there really are ~30,000 filled droplets and not all reads in the input have been parsed. Looking at the first knee plot above (rank vs. cumulative count), this threshold looks to be correct as the starting point.

@k3yavi - Would you mind adding here the command to use alevin with DropSeq data, should @fede86 be interested.

fede86 commented 5 years ago

Hi @TomSmithCGAT and @IanSudbery and thank you again for your help!

Thank you for bringing --subset-reads up, we have actually slightly more than 100 million reads, so its worth to change it. 1000 was not the mean number of transcript for each BC, I just meant that only a small proportion of those 30 000 BC actually have 1000 or more transcripts associated (<10%).

I spend a couple of hours investigating this problem and realized that these similar barcodes which have a edit distance of 1 usually (80%) differ only in the last position of the BC. Also from clustering our data I could see that these pairs of BC appear close together and probably originate from the same cell. Finally, I found information about this (obviously known) problem in the DS Cookbook (http://mccarrolllab.org/wp-content/uploads/2016/03/Drop-seqAlignmentCookbookv1.2Jan2016.pdf):

In June 2015, we noticed that a recently purchased batch of ChemGenes beads generated a population of cell barcodes (about 10-20%) with sequences that shared the first 11 bases, but differed at the last base. These same cell barcodes also had a very high percentage of the base “T” at the last position of the UMI. Based on these observations, we concluded that a percentage of beads in the lot had not undergone all twelve splitandpool bases (perhaps they had stuck to some piece of equipment or container, and the been reintroduced after the missing synthesis cycle). Thus, the 20bp Read 1 contained a mixed base at base 12 (in actuality, the first base of the UMI) and a fixed Tbase at base 20 (in actuality, the first base of the polyT segment). To correct for this, we generated DetectBeadSynthesisErrors, which identifies cell barcodes with aberrant “fixed” UMI bases. If only the last UMI base is fixed as a T, the cell barcode is corrected (the last base is trimmed off) and all cell barcodes with identical sequence at the first 11 bases are merged together. If any other UMI base is fixed, the reads with that cell barcode are discarded.

They also provide the script inserting a "N" at the 12th position of the BC for those cases, however, the input is a bam file. I would like to solve this issue while creating the whitelist on the raw seq file. I dont not if its worth it to implement a fix into your tools for this known (and obviously still occurring with much newer beads) error. However, to my understanding it maybe should also work when changing the lengths of BC (length 12) and UMI (length 8) to 11 and 9, respectively. Only very few cases of BC with REAL differences in the last position should be negatively affected by this change. Its maybe not optimal, but I will try if it solves the main issue.

Anyhow, I would be interested to also try alevin for whitelisting :)

TomSmithCGAT commented 5 years ago

From some relatively rudimentary experiments, I've found that 100M reads is more than enough to identify the correct knee threshold but you can obviously increase this if you desire.

Thanks for the details of where this problem originates. I was completely unaware of this but it perfectly explains the mismatch between the expected and observed number of cells and the low edit distance. Within UMI-tools I think we can get around this easily by changing the way the barcodes are extracted so that this 12th position is simply discarded from the read entirely and we just go forward with an 11 base barcode.

I've forgotten the barcode pattern for DropSeq but if you post your command I can show you how to update it to discard the 12th position.

fede86 commented 5 years ago

I am not fully sure what is the best way of fixing this problem that occurs in about ~15% of the BC. Surely, the 12th position has to be ignored for those BC, but this position actually belongs already to the UMI. Therefore, I think the best way is to adjust the BC/UMI pattern to define the last BC position as UMI instead of BC. Ignoring would be also an option. The only potential problem I see here is for the 85% of "correct beads having a length 12 BC" with real difference only in the last position. Pairs of those BC would be pooled then, however, I guess this should be negligible single occurrences. Currently my command (with adjusted bc-pattern) looks somewhat like this: umi_tools whitelist --stdin $i/FC*_1.fq.gz --bc-pattern=CCCCCCCCCCCNNNNNNNNN --extract-method=string --method=umis --log2stderr --log=$i/umi_tools/log --error=$i/umi_tools/error --plot-prefix=$i/umi_tools/ --set-cell-number=$line --subset-reads 200000000 > $i/umi_tools/whitelist.txt;done

TomSmithCGAT commented 5 years ago

Ah, yeah, I missread and thought their solution was simply to replace the 12th position with an N. Upon re-reading I understand it is not that simple. I'll give it some more thought to see if we can come up with a solution within umi_tools

IanSudbery commented 5 years ago

Could we not do: (?P<cell_1>.{11})(?P<discard_1>.)(?P<umi_1>.{8})(?<discard_2>.)

TomSmithCGAT commented 5 years ago

As I understand it, the problem can be summed up as:

In a subset of reads (10-20% according to the above)
the 12th base in the CB is actually the first base in the UMI
The only way to detect it prior to mapping is to identify a T at the last position in the UMI
- which is obviously not ideal since the 1/4 UMIs will correctly end in a UMI

I think the only stage to handle this properly is to consider the CB and UMI sequences of all reads with the same alignment coordinates, hence the script mentioned above takes the BAM as input. In theory it would be perfectly possible for umi_tools to support CB whitelisting/filtering on the BAM file (e.g extract UMIs and all CBs ->map->whitelist CBs (with optional UMI/CB correction to deal with the above)->dedup/count). This is not a quick fix though.

TomSmithCGAT commented 5 years ago

@fede86, I'm afraid there are no plans to deal with this within UMI-tools so as it stands, the DetectBeadSynthesisErrors script you mention from the DropSeq authors seems like the best approach.

Alternatively, if you wanted to add this functionality to UMI-tools we could help advise how best to do so.

TomSmithCGAT commented 4 years ago

I'm closing this issue due to inactivity.