CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

UMI-Tools extract feature in Alevin #588

Closed robingarcia closed 9 months ago

robingarcia commented 1 year ago

It is recommended to use Alevin instead of UMI tools. However, I cannot find the Extract feature from UMI-Tools in Alevin.

How would I write the following command in Alevin instead of UMI tools?

umi_tools extract --extract-method=regex \ --bc-pattern='(?P<cell_1>.{16})(?P<umi_1>.{10})' \ --stdin=Sample_S1_L004_R1_001.fastq.gz \ --stdout=proc_Sample_S1_L004_R1_001.fastq.gz \ --read2-in=Sample_S1_L004_R2_001.fastq.gz \ --read2-out=proc_Sample_S1_L004_R2_001.fastq.gz \ --whitelist=Sample_Whitelist_filt.csv \ --filtered-out=ext_Sample_S1_L004_R1_001.fastq.gz \ --filtered-out2=ext_Sample_S1_L004_R2_001.fastq.gz

I am grateful for any hints.

TomSmithCGAT commented 1 year ago

Hi @robingarcia. To be clear, alevin is recommended over umi_tools to perform quantification from scRNA-Seq data. alevin was specifically developed for this purpose and works as an 'end-to-end' solution, e.g there's no separate 'extract' step equivalent with alevin. The upside of this is that it alevin provides more accurate quantification and runs much faster. The downside is that you can't operate on the output of any of the intermediate steps.

Looking at your cell barcode, I guess you might be working with 10X scRNAseq data using Chromium v2 barcodes. If so, you can include the --chromium flag and alevin will handle the barcodes appropriately. See here for alevin docs.

See also alevin-fry, for a more complete scRNA toolkit: