use phthash (perfect minimal hash function) for Kraken 2?

RagnarGrootKoerkamp / PTRHash

PTRHash minimal perfect hash function, based of PTHash

33 stars 6 forks source link

use phthash (perfect minimal hash function) for Kraken 2? #10

Closed jianshu93 closed 1 month ago

jianshu93 commented 1 month ago

Dear @RagnarGrootKoerkamp,

Just what to hear your thoughts on perfect minimal hash function for a software called Kraken 2 (widely cited tool to classify metagenomic reads agains a database, see here https://link.springer.com/article/10.1186/s13059-019-1891-0). Essentially, a minimizer was extracted and then stored in a compressed hash table. They use some strategy to compress the hash table in Kraken 2 VS. kraken 1. This reduces accuracy (more collisions) but improve space (hash table is much smaller). I am wondering whether PTHash will be very useful in this case, both reduce collision (improve accuracy) and running time (external memory processing & efficient multithreading).

Best,

Jianshu

RagnarGrootKoerkamp commented 1 month ago

Hey!

The main limitation of perfect hashing is that these hash functions only work if the set of objects (kmers, or rather their minimizers) is static and known up front. Is that the case for Kraken? If so, yes, a perfect hash function could be used.

There is PTHash and the just out version Phobic, but also feel free to use PtrHash.

How large are the datasets? All these hash tables are 2-3 bits/element, which is very small compared to actually storing the kmers/minimizers themselves, but can indeed still be large for very large datasets.

jianshu93 commented 1 month ago

hi Ragnar,

the database for kraken is a fixed database (e.g., 100,000 genomes), genome that are there. Then extract minimizer from each genome for a give K, e.g., 31, and store the minimizer via a hash table. Then for query, do the same minimizer step and search the minimizer from the query genome in the hash table. Steps after are not related to hash table, just know which minimizer from query genome are also found for a database genome. Does this make sense to you?

Jianshu

RagnarGrootKoerkamp commented 1 month ago

Hmmm, that's tricky. The thing with perfect hash functions is that one can/should only query keys/minimizers that are actually in the hash table. When querying, it can happen that the query contains a key/minimizer that wasn't inserted, but since the MPHF does not store the actual inserted keys, it can not do collision detection, and will return a 'false positive' index, even though the inserted key actually does not exist.

This may be fine in certain cases, for example when you can verify whether the minimizer exists in another way, e.g. by cross checking with the reference. That's basically what LPHash and SSHash implement, so consider having a look at those.

Otherwise, you're out of luck and will have to use a more standard hash table, or e.g. something like CBL instead.

jianshu93 commented 1 month ago

Thanks for those comments, very helpful.

Jianshu