Are searches more vulnerable to mismatches in their center?

Laura-Alex commented 2 days ago

This is just a conceptual question I'd like to figure out in order to better plan my future searches.

Assuming we have a query of length 100. Assuming (for simplicity) that no 31-mers within the sequence are repeating.

If a single nucleotide mismatch with the query exists in the center of a subject sequence, it has the potential to cause a mismatch with 31 31-mers. Am I understanding this correctly? And since the total number of 31-mers in the 100bp sequence is ~69 (I think?), that would bring the kmer coverage to ~0.55.

However, a kmer coverage of ~0.55 could also be achieved if there were 30 mismatches clustered to one side?

I've tried to wrap my head around it with Excel and a simplified 10bp query with 3-mers (blue are matches; red are mismatches), but I'm not sure I'm understanding this correctly. Help would be appreciated! Untitled

pierrepeterlongo commented 1 day ago

Hi @Laura-Alex

Thanks for your message. You're right, one mismatch in the middle of a sequence "generates" $k$ absent kmers. A sequence of length $l$ contains $l-k+1$ kmers. In this case, the ratio of shared kmers (called kmer_coverage in https://logan-search.org/) is $(l-2k+1) / (l-k+1)$. So indeed, with $l=100$ and $k=31$, one "middle mismatch" leads to a kmer_coverage of $\approx0.56$.

Note that for a sequence of length 1000, this kmer_coverage with one substitution is $\approx0.97$.

I hope this helps. Pierre

Laura-Alex commented 1 day ago

Thank you!

IndexThePlanet / LoganSearch

Are searches more vulnerable to mismatches in their center? #2