eric9n / Kun-peng

Kun-peng: an ultra-fast, low-memory footprint and accurate taxonomy classifier for all
MIT License
10 stars 3 forks source link

open syncmer instead of minimzer? #25

Closed jianshu93 closed 3 weeks ago

jianshu93 commented 1 month ago

Hi @eric9n,

I open an additional issue here about using a different sketching algorithm, in addition to minimizer. I attached slide explaining the differences between minimizer and open syncmer and also the paper on open syncmer (https://peerj.com/articles/10805/).

Ideally, the syncmer/minimizer extracted, are just kmer from the genome and this will not affect how kmers will be compressed in a compacted hash table, just what kmers from genomes will be extracted is different.

The implementation should be straitforward. and I will point to some existing ones. Let me know your thoughts on the open syncmer idea.

Thanks,

Jianshu open_syncmer_kun_peng.pdf

eric9n commented 1 month ago

I think it might have advantages for specific databases, but it would be difficult to make it universally applicable.

jianshu93 commented 4 weeks ago

Actually open syncmer has been proved theoretically to be efficient in the case for reads mapping, e.g., in minimap2 for all cases (if open syncmers are chosen as seeds/anchors and then chain the seed and extend), I attached a paper where the prove based on minimizer is extremely difficult to obtain while for open syncmer it is much easier: https://genome.cshlp.org/content/33/7/1175.short

It seems for other genomes, human for example, it is slower due to repeated kmers, but for bacteria/archaea/virus genomes, not a problem all since those genomes are not multi-copy.

Thanks,

Jianshu

eric9n commented 4 weeks ago

In the paper you provided, the s-mer lengths for syncmers are generally below 10. Mathematically, this translates to about 2^22 (assuming a 4-letter alphabet), which wouldn't allow for the construction of a very large database.

Different methods are needed to build the database. This approach is not suitable for building a database for Kun Peng.