PengfanZhang / Rbec

6 stars 3 forks source link

The exact match #8

Closed yjiakang closed 6 months ago

yjiakang commented 6 months ago

Sorry for disturbance, may I ask if the meaning of exact match is 100% identical between ref seq and reads?

PengfanZhang commented 6 months ago

exactly. exact match means 100% identity and 100% coverage.

yjiakang commented 6 months ago

Thanks for your quick reply. So if the ref seq is two bp longer than the reads, it will not get the initial counts?

PengfanZhang commented 6 months ago

No, that's why the truncation of reference sequences is recommended in the database. If you can't ascertain that the reference sequence is 100% accurate, you can check the contamination_seq.fna file to see which potential contamination sequence is closest to the reference that is missing.

yjiakang commented 6 months ago

If the ref seq is 2 bp longer than the reads, will it influence the results much? Thanks for your patience.

PengfanZhang commented 6 months ago

If the ref seq is not exactly the same size and same sequence composition with any of the input amplicon reads, Rbec can not calculate the abundance for this reference sequence. Two potential suggestions for your case: 1) If the 2bp is at either end of the sequence, you can definitely truncate them before running Rbec; 2) If the 2bp insertion happens in the middle of the reference, either it means the reference strain is not in your community (or it's evolving, but it's less likely to see mutations happening in 16S/other marker genes in a short evolutionary time scale) or there're errors in your reference sequence considering the fact that if a read is observed multiple times in amplicon sequencing readouts, it should be accurate.