CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

total_counts_post = total_counts_pre ?! #495

Closed YichaoOU closed 2 years ago

YichaoOU commented 2 years ago

Hello,

I'm having a problem where the total_counts_post is the same as total_counts_pre. Same problem as in #372

median_counts_pre       265879
times_observed_pre      452617
total_counts_pre       3074225
median_counts_post      271451
times_observed_post     394637
total_counts_post      3074225

My input bam looks like:

NB551526:183:H53G3BGXJ:1:22208:3291:17516_AAGTTATAC 99  chr10   137106  60  77M69S  =   137106  77  ATGCATGCACTATCCTTGTCACATAAGCTGTACTCAGTGTCAGATGCAGTGTGTACCTAGTACCGAGTGTCATATGTTAATAACGGTATAGATCGGAAGAGCACACGTCTGAACTCGAGTCACATTCCGCTATCTCGTATGCCGTC  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EAEEEEEE/EEEEEEEAEEEEEEEEAEEEEEAEEEEE<EEEEEEEE/E/EE<EEEEAE<A  NM:i:0  MD:Z:77 AS:i:77 XS:i:0
NB551526:183:H53G3BGXJ:1:22208:3291:17516_AAGTTATAC 147 chr10   137106  60  57S77M12S   =   137106  -77 CTACACAAGGAGTAAAAGTTATACTCGTCGGCAGCGTCAGATGTGTATAAGAGACAGATGCATGCACTATCCTTGTCACATAAGCTGTACTCAGTGTCAGATGCAGTGTGTACCTAGTACCGAGTGTCATATGTTAATAACGGTAT  AA<<E<EEEEAEAEEAEEE<AEA<EEEEEEAA<EAE<EEE<<E/EAAAAEEEEEEEEEAEE<<EEEEEEEE<EEEEEEEAEEEEEEEE/EAEE<EEEEEEEA/EEEAAEEEEAEEEEEEEAEEEEEEAEEAE///EEEEAEAAAAA  NM:i:0  MD:Z:77 AS:i:77 XS:i:19

my command is: umi_tools dedup --stdin=test.st.bam --log=test.dedup.log --output-stats=test.stats.tsv --paired > test.dedup.bam

Thanks, Yichao

IanSudbery commented 2 years ago

What is happening here is that total_counts_post includes the counts of all reads that were corrected to a particluar UMI. So if we had a position where pre-dedup you had

3 x AAAA
1 x AAAT
1 x AATA

This would be corrected to:

5 x AAAA
0 x AAAT
0 x AATA

The best read with AAAA would then be the one output.

The _per_umi stats would then be

        times_observed_pre    total_count_pre    times_observed_post    total_count_post
AAAA    1                      3                  1                     5
AAAT    1                       1                  0                     0
AATA     1                      1                  0                     0   

You can see that the sum of total_counts_pre and total_counts_post are the same, as these are the "post-correction" numbers, rather than the "post deduplication" numbers. The number of positions that have a umi in the post-deduplication file is times_observed_post, as the count of each UMI at each position in the output file is always 1.

Sorry, this is confusing, and the documentation could use a rewrite to correct this.