CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

question: umi_tools whitelist to recover barcodes in an amplicon sequencing experiment #596

Closed MGordon09 closed 9 months ago

MGordon09 commented 1 year ago

Hi guys,

Thanks for developing this great suite of tools. I have a question regarding the applying UMI_tools extract to identify cell barcodes from what is essentially an amplicon sequencing experiment.

The goal is to develop a analysis pipeline to process data produced from a amplicon deep mutational scanning experiment (protocol here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9420302/). Essentially, the group generated a library of target cDNA mutants, where each library molecule contains a specific codon substitution and a unique 32bp barcode sequence for identification. Each sample contains ~30M reads (~250bp PE), and ~ 10^5 unique molecules per sample

I want to identify all 'true' cell barcodes accounting for sequencing errors and demultiplex by barcode for downstream variant calling and processing. I was hoping I could use umi_tools whitelist for this.

The barcodes are in-line and read structure (after trimming & merging) is the following:

XXXXXXX__NNN..___XXX
_  known flanking regions
N.. random barcode (n=32)

I ran the following command to generate a whitelist of the barcodes:

umi_tools whitelist --stdin SRR20707784_merge_trim.fastq.gz --method 'reads' --knee-method "density" --error-correct-threshold 2 --plot-prefix 'test.out.umitools' --ed-above-threshold 'discard' --extract-method=regex --bc-pattern='(?P<discard_1>.+TGGATCCGGTACCGAGGAGATCTG){s<=2}(?P<cell_1>.{32}){s<=2}(?P<discard_2>GCGATCGC.+$)'

but hit the following error:

ValueError: barcode regex(es) do not include any umi groups (starting with 'umi_') regex.Regex('(?P<discard_1>.+TGGATCCGGTACCGAGGAGATCTG){s<=2}(?P<cell_1>.{32}){s<=2}(?P<discard_2>GCGATCGC.+$)', flags=regex.V0), None

I ran into a similar error changing cell_1 to umi_1, so I guess both are required?

My questions are i) is it possible to run the program w/o specifying both a cell and umi sequence as I only have one? ii) is this approach valid or should i consider another tool for the purpose? Any other suggestions would be greatly appreciated!

Thanks for your time, Martin

TomSmithCGAT commented 1 year ago

Hi Martin,

Sorry for the very slow reply.

From what I understand, the authors refer to the 32bp barcode as a UMI, but since each barcode is specific to a cell, one wants to treat it like a cell barcode.

The whitelist capabilities of UMI-tools were really added as a bolt on to help support scRNA-Seq and there are likely to be better ways to perform whitelisting. I've recommended Sircel in the past, but I haven't kept up with the latest developments.

In your case, since you want to demultiplex using the cell barcodes, I don't think that UMI-tools will be the appropriate tool, as it's not desgined for demultiplexing.

MGordon09 commented 1 year ago

Hi Tom,

No problem! Thanks for responding.

Yes, I also thought the 32bp sequence sounded like an sc cell barcode, which was why I wanted to use umi_tools whitelist to extract barcode sequences, and use something likesabre or a custom script to demultiplex (apologies for not being clear). My question really was if whitelist would be suitable for recovering barcodes and correctly grouping barcodes with sequencing errors.

Thank you for the recommendation, I had not heard of Sircel and will take a look.

TomSmithCGAT commented 1 year ago

Ah, so you were going to use whitelist to identify the cell barcodes and the associated error barcodes but then demultiplex with another tool. In that case, whitelist could theoretically work, but as you say it currently errors if you try and provide a regex pattern which doesn't include both a umi and a cell barcode.

This is an issue that's been raised before (e.g #332) and we still have an open PR which resolves this by raising a warning instead of an error (#447).

The sticking point is a desire not to over expand UMI-tools into applications beyond UMIs. Arguably, we (I) did this when adding the whitelist tool, since it actually only operates on the cell barcodes. At this time, this was to better support scRNA-seq when there weren't many options available for handling the cell barcodes. However, the whitelisting approach has never been subjected to the same level of rigourous assessment that the UMI deduplication was.

Adding support to allow whitelisting for data without UMIs strikes us as a step too far for a tool designed to handle reads with UMIs. However, to make matters more confusing, it is possible to use whitelist without UMIs, but only when using --extract-method string, which is a much less flexible way to define the barcode pattern 🤦‍♂️.

@IanSudbery - I'm going to add this to the FAQs so we can direct future questions in that direction.

dawe commented 1 year ago

Hello, I'd like to chime in on this, as I had the same issue as @MGordon09 but in another application (scATAC-seq). As pointed in the PR I opened, the process would be rather easy, just changing an error into a warning. I've been using UMI-tools for quite long time to whitelist scATAC-seq data and I have to say it works just fine (after all it already operates only on cell barcodes, as you pointed). My current workflow is to use my fork of this repository where I constantly update umi_tools/Utilities.py commenting out the lines responsible for the UMI check (1171->1179). I don't use UMI-tools to perform deduplication which is instead handled by specific applications for scATAC-seq (chromap, in my case, but also custom code where needed). I understand @TomSmithCGAT worries about opening the possibility to handle non-UMI based applications, however I believe it's quite common that software (especially in compbio) is used in alternative ways beyond the original scope, hence I stand for the extension of capabilities, at least for whitelist generation.