CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

when add the parameter --ignore-umi , why the count matrix only 1 and 0? #538

Closed omicsclass closed 2 years ago

omicsclass commented 2 years ago

when add the parameter --ignore-umi , why the count matrix result only 1 and 0?

umi_tools count --per-gene --gene-tag=XT --assigned-status-tag=XS --per-cell --wide-format-cell-counts -I assigned_sortedProcessed.sorted.bam -S counts.tsv.gz --ignore-umi

IanSudbery commented 2 years ago

In normal operation, UMI-tools uses two pieces of information to decide if reads are duplicates of each other: their alignment "position" and the sequence of the UMI. In --per-gene, the alignment position is the gene to which a read is aligned, base-pair position within the gene is not taken to be of relevance as most relevant techniques fragment after amplification, duplicates from can have different base pair positions, but will always come from the same gene. This is bascially what --per-gene does.

When you use --ignore-umi (which is really only a debugging option), then UMI-tools uses only the position. Since for --per-gene position is the identity of the gene, all reads aligned to the same gene are regarded as duplicates of each other (as they have the same "position", and UMI is ignore), and are thus all collapsed onto a single read, as long as the there is at least one.

Thus all genes will have either 1 or 0.

wangjiawen2013 commented 1 year ago

Hi, our library have barcodes, but don't have umi, can we get reads counts if using --ignore-umi and discard --per-gene?