ARUP-NGS / BMFtools

Barcoded Molecular Families
MIT License
22 stars 8 forks source link

Truncated consensus sequence #104

Open Benni96 opened 7 years ago

Benni96 commented 7 years ago

Hi, I collapsed amplicon data and got some truncated reads during the collapse step.

The data was paired-ed reads which were stiched to single reads. The stiched reads were collapsed with the UMI being inline.

bmftools collapse inline -S -l 10 -s <homing> -f <prefix> -z <stiched reads>

After mapping I observed some reads which did not span the entire amplicon region. I checked back the read in the UMI file and in the stiched reads file. The "original" stiched read file contained 12900 reads with the UMI and 99.9% were full length and only 10 were smaller. However, the smallest read was still longer than the read in the UMI read file.

UMI: GCATCCACAAAT Stiched reads with this UMI: 12963 reads length distribution (count / length): 1 96
1 129 1 130 10 131 161 132 12787 133 2 134 length of the consensus of the UMI family: 69 bp The homing sequence is 3 nt and the barcode 10nt. Therefore, even if the 96nt should result in a consensus read.

Do you have any suggestions? Or was this observed before?

dnbaker commented 7 years ago

Were your input reads all of uniform read length? I'm surprised by this behavior; would you be willing to provide some data with which I can reproduce the issue?

Thank you!

Benni96 commented 7 years ago

Hi, The read length varies a bit (+/- 2 bp) as you also see in the post before. I also observed this phenomenom in other datasets at low level. Accidentely, I tried the UMI generation with the option "-n 5" and then the consensus reads were correct. However, I have no clue why this option does change the output. Do you?

dnbaker commented 7 years ago

BMFtools assumes uniform read length, which is why adapter masking, not trimming, is suggested.

Are you using Illumina data?

-n only changes memory requirements.

How many reads passed homing sequence?