GuillaumeHolley / BloomFilterTrie

An alignment-free, reference-free and incremental data structure for colored de Bruijn graph with application to pan-genome indexing.
MIT License
43 stars 6 forks source link

False-positive kmer hits #14

Open masakistan opened 6 years ago

masakistan commented 6 years ago

Thanks for the great software package, it's been great to use.

It is my understanding that the BFT shouldn't return false-positive kmer hits, is this correct?

If I'm correct, I think I've come across a bug. I'm adding only canonical kmers from two different genomes with k=90 to a BFT. When I query a particular kmer, I get a hit saying that it appears in both genomes but when I grep for the kmer in the kmer count dumps it only appears in one of the genomes.

The data files can be accessed here. The file contains:

The problematic kmer is AAAGAAAAAGGGGAAGAAATGGGCGAGGTAGCAAACGTAAATGAAATTCCGGTCAAGATAAGAAATCATAAGTATCCTGCGAAAGAACAT

Thanks for your help! Please let me know if you require any additional information or if I can do anything.

GuillaumeHolley commented 6 years ago

Hey Stanley,

Thanks for opening this issue and sorry that I answer so late (I am not actively working on the project). I will look into the matter as soon as I can ;)

Guillaume

masakistan commented 6 years ago

No worries, I appreciate the response and the great software package.

Please let me know if there is anything I can do to help.