Open dkoslicki opened 4 years ago
@ShaopengLiu1 #19 should be addressed now. I am not closing #19 or #2 until we have a better testing environment spun up, as all tests I have done are locally (very not optimal).
@ShaopengLiu1 just a note: I added a class that will now compute the absolute ground truth containment indicies. Recall that the last column of StreamingQueryDNADatabase.py
is still an estimate of the containment index (just using un-truncated k-mers). The class to compute the ground truth is at /CMash/CMash/GroundTruth.py
. You can see in this comment how the results by the tests/script_tests/./run_small_tests.sh
correspond quite nicely with the ground truth values.
If you would like to utilize this ground truth class, I strongly suggest you use my personal server (ping me if you forgot the IP address and login info) as it takes quite a bit of time and memory to brute-force calculate all the k-mers and their reverse complements.
To interact with the class, you can do something like:
import CMash.GroundTruth as G
training_database_file = "<snip>/TrainingDatabase.h5"
query_file = "<snip>/taxid_1192839_4_genomic.fna.gz"
g = G.TrueContainment(training_database_file=training_database_file, k_sizes="4-6-1") # this step will take a long time if the k_sizes are realistically large
df = g.return_containment_data_frame(query_file=query_file1, location_of_thresh=-1, coverage_threshold=.1)
Note that the query_file
need not be in the TrainingDatabase.h5
(as its k-mers will still be enumerated if it's not in the training database).
Definitions: "new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values "old method" = train and re-run CMash on each individual k-mer size.
Tasks:
StreamingQueryDNADatabase.py
with multiple k-mer size training database and compare to runningStreamingQueryDNADatabase.py
many times with training databases trained with a specific k-mer size).This would be sufficient for a conference paper. More details can follow depending on interest.
For a journal publication, would need to: