dkoslicki / CMash

Fast and accurate set similarity estimation via containment min hash
BSD 3-Clause "New" or "Revised" License
42 stars 9 forks source link

Multiple k-mer sizes confirmation and testing #20

Open dkoslicki opened 4 years ago

dkoslicki commented 4 years ago

Definitions: "new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values "old method" = train and re-run CMash on each individual k-mer size.

Tasks:

This would be sufficient for a conference paper. More details can follow depending on interest.

For a journal publication, would need to:

dkoslicki commented 4 years ago

@ShaopengLiu1 #19 should be addressed now. I am not closing #19 or #2 until we have a better testing environment spun up, as all tests I have done are locally (very not optimal).

dkoslicki commented 4 years ago

@ShaopengLiu1 just a note: I added a class that will now compute the absolute ground truth containment indicies. Recall that the last column of StreamingQueryDNADatabase.py is still an estimate of the containment index (just using un-truncated k-mers). The class to compute the ground truth is at /CMash/CMash/GroundTruth.py. You can see in this comment how the results by the tests/script_tests/./run_small_tests.sh correspond quite nicely with the ground truth values.

If you would like to utilize this ground truth class, I strongly suggest you use my personal server (ping me if you forgot the IP address and login info) as it takes quite a bit of time and memory to brute-force calculate all the k-mers and their reverse complements.

To interact with the class, you can do something like:

import CMash.GroundTruth as G
training_database_file = "<snip>/TrainingDatabase.h5"
query_file = "<snip>/taxid_1192839_4_genomic.fna.gz"
g = G.TrueContainment(training_database_file=training_database_file, k_sizes="4-6-1")  # this step will take a long time if the k_sizes are realistically large 
df = g.return_containment_data_frame(query_file=query_file1, location_of_thresh=-1, coverage_threshold=.1)

Note that the query_file need not be in the TrainingDatabase.h5 (as its k-mers will still be enumerated if it's not in the training database).