group stats differ by large amount with use of --ignore-tlen in --paired mode

ijhoskins commented 2 years ago

Hello,

I am running the group command (version 1.1.2) on paired reads locally aligned with bowtie2.

umitools group -I reads.bam --paired --no-sort-output --edit-distance-threshold=1 --umi-separator= --unpaired-reads=discard --unmapped-reads=discard --multimapping-detection-method X0

I noticed that when I use the --ignore-tlen flag in addition to the --paired flag, the stats reported with --log2stderr differ greatly:

--paired: 2022-02-28 06:16:11,618 INFO Reads: Input Reads: 52311643, Read pairs: 52311643, Read 2 unmapped: 182032, Read 1 unmapped: 42265 2022-02-28 06:16:11,618 INFO Number of reads out: 104216957, Number of groups: 8958952 2022-02-28 06:16:11,618 INFO Total number of positions deduplicated: 99805 2022-02-28 06:16:11,618 INFO Mean number of unique UMIs per position: 365.86 2022-02-28 06:16:11,618 INFO Max. number of unique UMIs per position: 49938

--paired and --ignore-tlen: 2022-02-24 07:44:42,708 INFO Reads: Input Reads: 52311643, Read pairs: 52311643, Read 2 unmapped: 182032, Read 1 unmapped: 42265 2022-02-24 07:44:42,708 INFO Number of reads out: 104216957, Number of groups: 7272088 2022-02-24 07:44:42,708 INFO Total number of positions deduplicated: 4731 2022-02-24 07:44:42,708 INFO Mean number of unique UMIs per position: 7512.49 2022-02-24 07:44:42,708 INFO Max. number of unique UMIs per position: 50298

The number of groups and max UMIs per position make sense to me, but the following stats show large discrepancies:

Total number of positions deduplicated
Mean number of unique UMIs per position

I would expect these stats to be the same. In addition, there are far less than 99805 positions in my reference (a single transcript). Does this mean the "positions" stat enumerates position-tlengths?

Thank you for any clarification you could provide!

IanSudbery commented 2 years ago

In terms of duplicating reads --ignore-tlen effectively means that reads are grouped together as though they were single ended - that is, only information from read1 is used to calculate the uniqueness of the read. So for example:

     pair1-read1|>>>>>>>>>>|                                   |<<<<<<<<<<<<<|pair1-read2
     pair2-read1|>>>>>>>>>>|                                               |<<<<<<<<<<<<<|pair2-read2

would be regarded as mapping to different locations with --paired, but the same location with --ignore-tlen --paired

You are correct with your second supposition the number of positions is position-tlen combinations, and would be better labeled "read bundles", which are the basic collection of reads on which the UMI algorithm is applied.

ijhoskins commented 2 years ago

Hi @IanSudbery thank you for your response and clarification about the stats. Also, the position for grouping is always the 5' end of the R1 (i.e. rightmost mapped position for R1 on the - strand)?

IanSudbery commented 2 years ago

The position is the 5' most end of R1, relative to the orientation of the read, not the genome. So if the read is on the +ve strand, then it is the right most base, and if its on the -ve strand, its the left most. Note that its the 5' most base of the read, not the 5' most base of the alignment. So if the read is 5' softclipped, the softclipping is effectively undone. There is a notebook somewhere (I believe in the repo that went along with the paper) that demonstrates this captures a higher proportion of duplicates.

ijhoskins commented 2 years ago

Excellent, thank you for the added detail. For posterity, I believe there was a small typo in your above statement, which should be corrected to "if the read is on the +ve strand, then it is the left most base, and if its on the -ve strand, its the right most".

CGATOxford / UMI-tools

group stats differ by large amount with use of --ignore-tlen in --paired mode #517