Closed YichaoOU closed 2 years ago
What is happening here is that total_counts_post
includes the counts of all reads that were corrected to a particluar UMI. So if we had a position where pre-dedup you had
3 x AAAA
1 x AAAT
1 x AATA
This would be corrected to:
5 x AAAA
0 x AAAT
0 x AATA
The best read with AAAA
would then be the one output.
The _per_umi stats would then be
times_observed_pre total_count_pre times_observed_post total_count_post
AAAA 1 3 1 5
AAAT 1 1 0 0
AATA 1 1 0 0
You can see that the sum of total_counts_pre
and total_counts_post
are the same, as these are the "post-correction" numbers, rather than the "post deduplication" numbers. The number of positions that have a umi in the post-deduplication file is times_observed_post
, as the count of each UMI at each position in the output file is always 1.
Sorry, this is confusing, and the documentation could use a rewrite to correct this.
Hello,
I'm having a problem where the
total_counts_post
is the same astotal_counts_pre
. Same problem as in #372My input bam looks like:
my command is:
umi_tools dedup --stdin=test.st.bam --log=test.dedup.log --output-stats=test.stats.tsv --paired > test.dedup.bam
Thanks, Yichao