dkoslicki / CMash

Fast and accurate set similarity estimation via containment min hash
BSD 3-Clause "New" or "Revised" License
42 stars 9 forks source link

Testing environment #14

Open dkoslicki opened 4 years ago

dkoslicki commented 4 years ago

Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including:

dkoslicki commented 4 years ago

Some unit tests are in. MinHash module tests check validity of results. Query module is only really checking for code-breaking errors at this point, as there are a lot of FIXME's and TODO's.

Will need to:

Will be tagging this as help wanted and assigning everyone, since all are welcome to contribute.

SOP: create new branch:

git checkout master
git pull origin master  # make sure code is up to date
git checkout -b <some_feature_branch_name>  # create a new branch implementing a new testing feature
# add your new feature
git commit -a  # commit your contributions
git push origin <some_feature_branch_name>  # push your changes to your feature branch
# then request a code review before merging to master
dkoslicki commented 4 years ago

Note: while I assigned all, this is mainly a QOL (quality of life) issue: things that will make our future contributions easier in the future, but should not distract from main projects. i.e. as time permits.

dkoslicki commented 4 years ago

@dkoslicki Make sure GroundTruth.py is identifying kmers and rc-kmers, not counting them as distinct.

dkoslicki commented 4 years ago

./run_small_tests.sh ,k=10,k=12,k=14,k=16,k=18,k=20 taxid_1192839_4_genomic.fna.gz,1.0,1.0,1.0,1.0,1.0,1.0 taxid_28901_877_genomic.fna.gz,1.0,0.786,0.416,0.332,0.294,0.274

Ground truth on server since takes a fair bit of memory import CMash.GroundTruth as G query_file="/data/dmk333/repos/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz" training_file="/data/dmk333/repos/CMash/tests/script_tests/TrainingDatabase.h5" g = G.TrueContainment(training_file, "10-21-2") df = g.return_containment_data_frame(query_file, -1, .1) print(df) k=10 k=12 k=14 k=16 k=18 k=20 taxid_1192839_4_genomic.fna.gz 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 taxid_28901_877_genomic.fna.gz 0.970794 0.648166 0.404911 0.336364 0.303958 0.279067

Well that looks pretty nice to me!

dkoslicki commented 4 years ago

Switched to canonical k-mers to sanity check things, results basically unchanged: Ground truth on server since takes a fair bit of memory k=10 k=12 k=14 k=16 k=18 k=20 taxid_1192839_4_genomic.fna.gz 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 taxid_28901_877_genomic.fna.gz 0.970735 0.64816 0.404912 0.336364 0.303959 0.279068

So we'll be sticking with canonical k-mers for the ground truth as it's much more straightforward to understand.

dkoslicki commented 4 years ago

Note to self @dkoslicki: something odd is happening at small k-mer sizes: using run_comparison_to_ground_truth.sh via GroundTruth.py, in __return_containment_index:

return len(set1.intersection(set2)) / float(len(set1))

seems correct, but

return len(set1.intersection(set2)) / float(len(set2))

returns accurate small k-mer size results... eg.

import CMash.GroundTruth as G
training_database_file = "/home/dkoslicki/Desktop/CMash/tests/script_tests/TrainingDatabase.h5"
query_file1 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz"
query_file2 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_562_8705_genomic.fna.gz"
g = G.TrueContainment(training_database_file, "4-6-1")
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file1][4]))
1.0
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file2][4]))
0.3056179775280899

And the StreamingQueryDNADatabase.py is returning a 1 (not the 0.3056). Clearly, query_file2 is basically three copies of query_file1 at k=4, but why ok results at higher k-mer sizes?

Oh yeah, and StreamingQueryDNADatabase.py uses a heck of a lot of memory for small k-mer sizes. Probably khmer or screed's fault, but that's TBD.

dkoslicki commented 4 years ago

Regarding direction of containment, I think the committed way is best: set1 as denom

Total error per k-mer size:
k=8     0.043016
k=10    0.354925
k=12    2.485572
k=14    0.690597
k=16    0.161794
k=18    0.076439
k=20    0.035385
k=22    0.008816
dtype: float64

set2 as denom:

Total error per k-mer size:
k=8     0.168598
k=10    2.173376
k=12    3.924027
k=14    0.832073
k=16    0.140583
k=18    0.047191
k=20    0.009990
k=22    0.018207
dtype: float64

But clearly something is up with k=12. Odd... This is using run_comparison_to_ground_truth.sh with:

testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="8-${maxK}-2"
numHashes=10000
containmentThresh=0
locationOfThresh=-1

But clearly something is up with k=12. Odd... |true-CMash|:

genome k=8 k=10 k=12 k=14 k=16 k=18 k=20 k=22
taxid_1192839_4_genomic.fna.gz 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000e+00 0.000000e+00
taxid_1307_414_genomic.fna.gz 0.000761 0.049504 0.257595 0.048064 0.003751 0.000108 2.923078e-07 3.920249e-05
taxid_1311_236_genomic.fna.gz 0.001250 0.050466 0.278542 0.050275 0.003800 0.000344 7.373714e-05 2.379639e-05
taxid_1759312_genomic.fna.gz 0.000639 0.034805 0.260666 0.058385 0.005820 0.000915 2.118953e-04 1.701001e-04
taxid_2026799_87_genomic.fna.gz 0.000761 0.045469 0.262687 0.055671 0.004839 0.000117 9.684609e-06 5.321341e-05
taxid_2041488_genomic.fna.gz 0.000067 0.024380 0.216208 0.039611 0.003973 0.000272 7.260736e-05 1.380767e-04
taxid_28901_877_genomic.fna.gz 0.000608 0.029265 0.257055 0.151288 0.081336 0.052041 2.663244e-02 6.756324e-03
taxid_554168_genomic.fna.gz 0.001172 0.043717 0.304057 0.059005 0.005468 0.000548 1.954086e-05 8.476430e-07
taxid_562_8705_genomic.fna.gz 0.027607 0.039054 0.315867 0.110230 0.026908 0.012463 4.603500e-03 1.080697e-03
taxid_573_36_genomic.fna.gz 0.010152 0.038264 0.332896 0.118068 0.025898 0.009632 3.761221e-03 5.540108e-04

Screenshot 2020-03-27 18 06 28

Now to test on a "real" metagenome...

dkoslicki commented 4 years ago

And note, the problem appears to only be at k=12: with

testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="14-${maxK}-1"
numHashes=10000
containmentThresh=0
locationOfThresh=-1

we get Screenshot 2020-03-27 18 10 24

dkoslicki commented 4 years ago

Will create new issue for ground truth containment computation so it will be easier to track progress on this.