total_counts_post = total_counts_pre ?!

CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets

MIT License

493 stars 190 forks source link

NB551526:183:H53G3BGXJ:1:22208:3291:17516_AAGTTATAC 99 chr10 137106 60 77M69S = 137106 77 ATGCATGCACTATCCTTGTCACATAAGCTGTACTCAGTGTCAGATGCAGTGTGTACCTAGTACCGAGTGTCATATGTTAATAACGGTATAGATCGGAAGAGCACACGTCTGAACTCGAGTCACATTCCGCTATCTCGTATGCCGTC AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6EAEEEEEE/EEEEEEEAEEEEEEEEAEEEEEAEEEEE<EEEEEEEE/E/EE<EEEEAE<A NM:i:0 MD:Z:77 AS:i:77 XS:i:0 NB551526:183:H53G3BGXJ:1:22208:3291:17516_AAGTTATAC 147 chr10 137106 60 57S77M12S = 137106 -77 CTACACAAGGAGTAAAAGTTATACTCGTCGGCAGCGTCAGATGTGTATAAGAGACAGATGCATGCACTATCCTTGTCACATAAGCTGTACTCAGTGTCAGATGCAGTGTGTACCTAGTACCGAGTGTCATATGTTAATAACGGTAT AA<<E<EEEEAEAEEAEEE<AEA<EEEEEEAA<EAE<EEE<<E/EAAAAEEEEEEEEEAEE<<EEEEEEEE<EEEEEEEAEEEEEEEE/EAEE<EEEEEEEA/EEEAAEEEEAEEEEEEEAEEEEEEAEEAE///EEEEAEAAAAA NM:i:0 MD:Z:77 AS:i:77 XS:i:19

What is happening here is that total_counts_post includes the counts of all reads that were corrected to a particluar UMI. So if we had a position where pre-dedup you had

3 x AAAA
1 x AAAT
1 x AATA

This would be corrected to:

5 x AAAA
0 x AAAT
0 x AATA

The best read with AAAA would then be the one output.

The _per_umi stats would then be

        times_observed_pre    total_count_pre    times_observed_post    total_count_post
AAAA    1                      3                  1                     5
AAAT    1                       1                  0                     0
AATA     1                      1                  0                     0

You can see that the sum of total_counts_pre and total_counts_post are the same, as these are the "post-correction" numbers, rather than the "post deduplication" numbers. The number of positions that have a umi in the post-deduplication file is times_observed_post, as the count of each UMI at each position in the output file is always 1.

Sorry, this is confusing, and the documentation could use a rewrite to correct this.

CGATOxford / UMI-tools

total_counts_post = total_counts_pre ?! #495