CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
491 stars 190 forks source link

about UMI error correction #489

Closed 10KGenomics closed 2 years ago

10KGenomics commented 3 years ago

Dear sir Hello! I analyzed the high-throughput single-cell transcriptome data. After using umitools software for whitelist, extract and star, I used featurecounts to count the gene expression matrix of each cell. The results showed that the number of expressed genes in a cell was more than 5000, while the number of UMIS was as high as more than 700000, which was obviously abnormal. Guess it should be sequencing error or PCR. I want to treat the UMI with a difference of 1bp or 2bp as the same UMI. Can this umitools be solved? umi tools dedup or umi tools group? Can you share specific instructions? We look forward to your reply. Thank you very much. Best wishes.

IanSudbery commented 3 years ago

Yes, UMI-tools group, dedup and count apply an error correcting algorithm to identify indepdent molcules.

If you are doing scRNA-seq, I'd guess you'd want to use count. The exact pipeline depends on the technology you used for the scRNA-seq. If we was 10x, then you can follow the tutorial here: https://umi-tools.readthedocs.io/en/latest/Single_cell_tutorial.html

This assumes that you are using a technology, like 10x, where all the cells are together in a single pair of fastq files, and that the reads contain UMIs and cell barcodes showing which cell each read comes from. It can be adapted for other technologies, such as inDrop, or Drop-seq or CEL-seq by changing the UMI barcode specificiation. This approach is basically > whitelist, extract, star, featureCounts, umi_tools count. However, often in these cases we'd actaully recommend alevin, which uses a similar deduplication appraoch to UMI-tools, but is transcript aware, and is much faster and less memory demanding.

If you are using a well-by-well apprach, such as SMART-seq3 where each cell comes as its own pair of fastqs, then you might be better off starting with this tutorial: https://umi-tools.readthedocs.io/en/latest/QUICK_START.html, but be aware you'll want to adapt it to use paired-end sequencing by adding --paired to the dedup call. Here basically you would do extract>star>dedup>featureCounts.

IanSudbery commented 2 years ago

I'm closing this quesiton due to lack of activity. Let me know if you have any further quesitons.