Multiple k-mer sizes confirmation and testing

dkoslicki commented 4 years ago

Definitions: "new method" = use a very large k-mer size, put in ternary search trie, use prefix matches to infer smaller k-mer size containment values "old method" = train and re-run CMash on each individual k-mer size.

Tasks:

[x] Address #19 so we no nothing funky is happening with the current implementation.
[x] Create testing environment to prepare for comparing old method to new method
[ ] Compare old to new method in estimating containment indexes between individual genomes (i.e. using this kind of command). Repeat over many different genomes to get an idea of the difference between old and new method
[ ] Compare old to new method in estimating presence/absence of training database organisms in many simulated metagenomes (i.e. run StreamingQueryDNADatabase.py with multiple k-mer size training database and compare to running StreamingQueryDNADatabase.py many times with training databases trained with a specific k-mer size).

This would be sufficient for a conference paper. More details can follow depending on interest.

For a journal publication, would need to:

[ ] Understand the theory behind the bias in the k-mer prefix truncation (@dkoslicki has already written it up)
[ ] Test the magnitude of the bias factor over many genomes (relatively straightforward task).

dkoslicki commented 4 years ago

@ShaopengLiu1 #19 should be addressed now. I am not closing #19 or #2 until we have a better testing environment spun up, as all tests I have done are locally (very not optimal).

dkoslicki commented 4 years ago

@ShaopengLiu1 just a note: I added a class that will now compute the absolute ground truth containment indicies. Recall that the last column of StreamingQueryDNADatabase.py is still an estimate of the containment index (just using un-truncated k-mers). The class to compute the ground truth is at /CMash/CMash/GroundTruth.py. You can see in this comment how the results by the tests/script_tests/./run_small_tests.sh correspond quite nicely with the ground truth values.

If you would like to utilize this ground truth class, I strongly suggest you use my personal server (ping me if you forgot the IP address and login info) as it takes quite a bit of time and memory to brute-force calculate all the k-mers and their reverse complements.

To interact with the class, you can do something like:

import CMash.GroundTruth as G
training_database_file = "<snip>/TrainingDatabase.h5"
query_file = "<snip>/taxid_1192839_4_genomic.fna.gz"
g = G.TrueContainment(training_database_file=training_database_file, k_sizes="4-6-1")  # this step will take a long time if the k_sizes are realistically large 
df = g.return_containment_data_frame(query_file=query_file1, location_of_thresh=-1, coverage_threshold=.1)

Note that the query_file need not be in the TrainingDatabase.h5 (as its k-mers will still be enumerated if it's not in the training database).

dkoslicki / CMash

Multiple k-mer sizes confirmation and testing #20