Open dkoslicki opened 4 years ago
Some unit tests are in. MinHash module tests check validity of results. Query module is only really checking for code-breaking errors at this point, as there are a lot of FIXME's and TODO's.
Will need to:
Will be tagging this as help wanted and assigning everyone, since all are welcome to contribute.
SOP: create new branch:
git checkout master
git pull origin master # make sure code is up to date
git checkout -b <some_feature_branch_name> # create a new branch implementing a new testing feature
# add your new feature
git commit -a # commit your contributions
git push origin <some_feature_branch_name> # push your changes to your feature branch
# then request a code review before merging to master
Note: while I assigned all, this is mainly a QOL (quality of life) issue: things that will make our future contributions easier in the future, but should not distract from main projects. i.e. as time permits.
@dkoslicki Make sure GroundTruth.py
is identifying kmers and rc-kmers, not counting them as distinct.
./run_small_tests.sh
,k=10,k=12,k=14,k=16,k=18,k=20
taxid_1192839_4_genomic.fna.gz,1.0,1.0,1.0,1.0,1.0,1.0
taxid_28901_877_genomic.fna.gz,1.0,0.786,0.416,0.332,0.294,0.274
Ground truth on server since takes a fair bit of memory import CMash.GroundTruth as G query_file="/data/dmk333/repos/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz" training_file="/data/dmk333/repos/CMash/tests/script_tests/TrainingDatabase.h5" g = G.TrueContainment(training_file, "10-21-2") df = g.return_containment_data_frame(query_file, -1, .1) print(df) k=10 k=12 k=14 k=16 k=18 k=20 taxid_1192839_4_genomic.fna.gz 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 taxid_28901_877_genomic.fna.gz 0.970794 0.648166 0.404911 0.336364 0.303958 0.279067
Well that looks pretty nice to me!
Switched to canonical k-mers to sanity check things, results basically unchanged: Ground truth on server since takes a fair bit of memory k=10 k=12 k=14 k=16 k=18 k=20 taxid_1192839_4_genomic.fna.gz 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 taxid_28901_877_genomic.fna.gz 0.970735 0.64816 0.404912 0.336364 0.303959 0.279068
So we'll be sticking with canonical k-mers for the ground truth as it's much more straightforward to understand.
Note to self @dkoslicki: something odd is happening at small k-mer sizes: using run_comparison_to_ground_truth.sh
via GroundTruth.py
, in __return_containment_index
:
return len(set1.intersection(set2)) / float(len(set1))
seems correct, but
return len(set1.intersection(set2)) / float(len(set2))
returns accurate small k-mer size results... eg.
import CMash.GroundTruth as G
training_database_file = "/home/dkoslicki/Desktop/CMash/tests/script_tests/TrainingDatabase.h5"
query_file1 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_1192839_4_genomic.fna.gz"
query_file2 = "/home/dkoslicki/Desktop/CMash/tests/Organisms/taxid_562_8705_genomic.fna.gz"
g = G.TrueContainment(training_database_file, "4-6-1")
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file1][4]))
1.0
len(g.training_file_to_ksize_to_kmers[query_file1][4].intersection(g.training_file_to_ksize_to_kmers[query_file2][4]))/float(len(g.training_file_to_ksize_to_kmers[query_file2][4]))
0.3056179775280899
And the StreamingQueryDNADatabase.py
is returning a 1 (not the 0.3056).
Clearly, query_file2
is basically three copies of query_file1
at k=4
, but why ok results at higher k-mer sizes?
Oh yeah, and StreamingQueryDNADatabase.py
uses a heck of a lot of memory for small k-mer sizes. Probably khmer
or screed
's fault, but that's TBD.
Regarding direction of containment, I think the committed way is best: set1 as denom
Total error per k-mer size:
k=8 0.043016
k=10 0.354925
k=12 2.485572
k=14 0.690597
k=16 0.161794
k=18 0.076439
k=20 0.035385
k=22 0.008816
dtype: float64
set2 as denom:
Total error per k-mer size:
k=8 0.168598
k=10 2.173376
k=12 3.924027
k=14 0.832073
k=16 0.140583
k=18 0.047191
k=20 0.009990
k=22 0.018207
dtype: float64
But clearly something is up with k=12
. Odd...
This is using run_comparison_to_ground_truth.sh
with:
testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="8-${maxK}-2"
numHashes=10000
containmentThresh=0
locationOfThresh=-1
But clearly something is up with k=12
. Odd...
|true-CMash|:
genome | k=8 | k=10 | k=12 | k=14 | k=16 | k=18 | k=20 | k=22 |
---|---|---|---|---|---|---|---|---|
taxid_1192839_4_genomic.fna.gz | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000e+00 |
taxid_1307_414_genomic.fna.gz | 0.000761 | 0.049504 | 0.257595 | 0.048064 | 0.003751 | 0.000108 | 2.923078e-07 | 3.920249e-05 |
taxid_1311_236_genomic.fna.gz | 0.001250 | 0.050466 | 0.278542 | 0.050275 | 0.003800 | 0.000344 | 7.373714e-05 | 2.379639e-05 |
taxid_1759312_genomic.fna.gz | 0.000639 | 0.034805 | 0.260666 | 0.058385 | 0.005820 | 0.000915 | 2.118953e-04 | 1.701001e-04 |
taxid_2026799_87_genomic.fna.gz | 0.000761 | 0.045469 | 0.262687 | 0.055671 | 0.004839 | 0.000117 | 9.684609e-06 | 5.321341e-05 |
taxid_2041488_genomic.fna.gz | 0.000067 | 0.024380 | 0.216208 | 0.039611 | 0.003973 | 0.000272 | 7.260736e-05 | 1.380767e-04 |
taxid_28901_877_genomic.fna.gz | 0.000608 | 0.029265 | 0.257055 | 0.151288 | 0.081336 | 0.052041 | 2.663244e-02 | 6.756324e-03 |
taxid_554168_genomic.fna.gz | 0.001172 | 0.043717 | 0.304057 | 0.059005 | 0.005468 | 0.000548 | 1.954086e-05 | 8.476430e-07 |
taxid_562_8705_genomic.fna.gz | 0.027607 | 0.039054 | 0.315867 | 0.110230 | 0.026908 | 0.012463 | 4.603500e-03 | 1.080697e-03 |
taxid_573_36_genomic.fna.gz | 0.010152 | 0.038264 | 0.332896 | 0.118068 | 0.025898 | 0.009632 | 3.761221e-03 | 5.540108e-04 |
Now to test on a "real" metagenome...
And note, the problem appears to only be at k=12: with
testOrganism="../Organisms/taxid_1192839_4_genomic.fna.gz"
maxK=22
kSizes="14-${maxK}-1"
numHashes=10000
containmentThresh=0
locationOfThresh=-1
we get
Will create new issue for ground truth containment computation so it will be easier to track progress on this.
Current tests are end-to-end integration tests that makes sure scripts execute successfully. There is much more testing that could be done including:
tests
folder (lots can be copied fromCMash/MinHash.py
)