GroupReadsByUmi groups reads with or without the same end.

jeanquassi commented 4 months ago

Hi, I am working on amplicon sequencing data with UMI. After grouping with GroupReadsByUmi (strategy=identity), among others UMI families I get the two below(showing a couple of the reads) :

UMI family 2160

UMI family 4716

As I understand the grouping method : The 2 families are mapped on the same amplicon, they have the same UMI sequence, all the reads start at the same position. The only difference is seems to be their ending position; and that is their are separated in 2 families. If the above assumption is true, why is the 3rd read in the family 4716 (0074-01718) grouped there since it has only length 42 compared to the others reads with a length of 138?

I am new using the GroupReadsByUmi and read the official instruction page. However I might have misunderstand the grouping method. Thank you in advance for your help.

nh13 commented 4 months ago

If reads are from the same source molecule, then the outer ends of the template are assumed to be the same. Since you're not using paired end, then the outer ends are the start and end of each read. I'd ask if you're doing some up front quality or adapter trimming prior to grouping of the reads to make their read lengths different.

jeanquassi commented 4 months ago

Hi Nils, thank you for your swift answer. Indeed we do some adapter trimming before the grouping. I will check the quality of the trimming for the strange reads.

thank you again for your help

fulcrumgenomics / fgbio

GroupReadsByUmi groups reads with or without the same end. #979