Closed Laura-Alex closed 1 day ago
Hi @Laura-Alex
Thanks for your message. You're right, one mismatch in the middle of a sequence "generates" $k$ absent kmers. A sequence of length $l$ contains $l-k+1$ kmers. In this case, the ratio of shared kmers (called kmer_coverage in https://logan-search.org/) is $(l-2k+1) / (l-k+1)$. So indeed, with $l=100$ and $k=31$, one "middle mismatch" leads to a kmer_coverage of $\approx0.56$.
Note that for a sequence of length 1000, this kmer_coverage with one substitution is $\approx0.97$.
I hope this helps. Pierre
Thank you!
This is just a conceptual question I'd like to figure out in order to better plan my future searches.
Assuming we have a query of length 100. Assuming (for simplicity) that no 31-mers within the sequence are repeating.
If a single nucleotide mismatch with the query exists in the center of a subject sequence, it has the potential to cause a mismatch with 31 31-mers. Am I understanding this correctly? And since the total number of 31-mers in the 100bp sequence is ~69 (I think?), that would bring the kmer coverage to ~0.55.
However, a kmer coverage of ~0.55 could also be achieved if there were 30 mismatches clustered to one side?
I've tried to wrap my head around it with Excel and a simplified 10bp query with 3-mers (blue are matches; red are mismatches), but I'm not sure I'm understanding this correctly. Help would be appreciated!