dkoslicki / CMash

Fast and accurate set similarity estimation via containment min hash
BSD 3-Clause "New" or "Revised" License
42 stars 9 forks source link

Improved classification time with KMC #15

Open dkoslicki opened 4 years ago

dkoslicki commented 4 years ago

When running StreamingQueryDNADatabase.py, in reality, we need only the K-mers in the sample that exist in the training database sketches. As such, it's possible to:

  1. Dump all the training database sketch k-mers using KMC (like it's done here
  2. Use KMC to count the k-mers in the sample
  3. Intersect these with the k-mers in the training database sketches and dump these to a file
  4. Reformat these dumped k-mers into a FASTA-looking file
  5. Feed that into StreamingQueryDNADatabase.py

Steps 2-5 is basically what's done here as I noted this approach to Nathan LaPierre, but never got around to implementing it in CMash yet.

dkoslicki commented 4 years ago

Could be nice to write this as an optional script that will be run before the StreamingQueryDNADatabase.py script and then compare runtime performance with/without this pre-processing (as you’ll also not need the Bloom filter pre-filter). You’ll lose the ability to keep track of k-mer counts (which is the focus of #21), but it would be for a different use case (eg. Metalign). @IsaacT1123 This might be a good first thing to cut your teeth on and could be a component of the eventual CMash publication mentioned in #20 .

IsaacT1123 commented 4 years ago

I’d be happy to work on this enhancement after getting CMash installed and creating a retraining script.

dkoslicki commented 4 years ago

Excellent! Go ahead and make a new branch (named something like KMC or Issue15 or the like), and I'll assign this issue to you

dkoslicki commented 4 years ago

@IsaacT1123 a friendly reminder to reference this issue in your commits! It's a great way to keep track of changes being made, and is also a visible history of your contributions to FOSS projects!

IsaacT1123 commented 4 years ago

I’ll be sure to do so from now on- thank you for reminding me

dkoslicki commented 4 years ago

Still in the process of diagnosing the issue. Will need to return to it later @dkoslicki

dkoslicki commented 4 years ago

Definitely an issue with the kmc_dump and then the intersection: with

/usr/bin/time python ${scriptsDir}/MakeStreamingDNADatabase.py filenames.txt TrainingDatabase.h5 -k 10
/usr/bin/time python ${scriptsDir}/StreamingQueryDNADatabase.py ${testOrganism} TrainingDatabase.h5 results.csv 10-10-1 --sensitive --intersect -c 0

and then checking:

grep -c '>' 10mers_intersection_dump.fa

you get 3 10-mers, which is totally not accurate since the testOrganism itself has way more than 3 10-mers.

dkoslicki commented 4 years ago

Indeed, it might be the kmc_tools simple intersect that's the problem since:

comm -1 -2 <(kmc_dump reads_10mers_dump /dev/fd/1 | cut -f1 | sort) <(grep -v '>' TrainingDatabase_dump.fa | sort) | wc -l

gives 689. I.e. if you dump the 10mers from the reads, dump the training database 10-mers, and count the ones in common, you get much more than 3...

dkoslicki commented 4 years ago

@IsaacT1123 So the -fa to -fm in Intersect.count_training_kmers() fixed one issue, so now tests work if you use a single k-mer size, but still don't pass if you use multiple k-mer sizes...

dkoslicki commented 4 years ago

@IsaacT1123 Finally figured out the issue, and it's "obvious" now that I see what happens: When intersecting the reads 21-mers with the training database 21-mers, you miss out on some k-mers with k<21 that are in the reads (those that aren't prefixes of some database 21-mer). Eg. the query file has the 11-mer "TGCCCTGTGGC" in it, but since this isn't a prefix of a 21-mer in the training database, it isn't a prefix of any 21-mer included in the 21mers_intersection_dump.fa file. Hence these smaller k-mer sizes (k<21) are undercounted. i.e. The KMC prefilter at this time should only be used when the training database is constructed with a k-mer size of K and the StreamingQueryDNADatabase.py is called with the k-range of `K-K-1. This explains why the above comment noted that things work for a single k-mer size.

I will need to think about if the KMC approach will work for multiple k-mer sizes. Thankfully, applications like Metalign (that only use a single k-mer size), are unaffected by this issue.

luizirber commented 4 years ago

i.e. The KMC prefilter at this time should only be used when the training database is constructed with a k-mer size of K and the StreamingQueryDNADatabase.py is called with the k-range of `K-K-1.

I was using both 21-51-10 and K-K-1 in my tests and was seeing ~40% differences, I will stick with K-K-1 for the near future.

(even so, if you look at the table in this section there are some datasets with big differences to the ground truth...)

dkoslicki commented 4 years ago

@luizirber

i.e. The KMC prefilter at this time should only be used when the training database is constructed with a k-mer size of K and the StreamingQueryDNADatabase.py is called with the k-range of `K-K-1.

I was using both 21-51-10 and K-K-1 in my tests and was seeing ~40% differences, I will stick with K-K-1 for the near future.

(even so, if you look at the table in this section there are some datasets with big differences to the ground truth...)

  1. Please do note that that it is indeed expected that when using StreamingQueryDNADatabase.py <snip> bot-top-diff, then the containment index estimates for k-mer sizes bot <= k < top will indeed be off (sometimes significantly so) while k-mer size k=top will be at least as accurate as the manuscript bloom-filter approach.

Recall that last year when I presented this approach, I mentioned how when you take the prefix of a k-mer, it is not a truly random sample, but a biased sample. The theory has been worked out by how much this affects the containment index due to the bias, but unfortunately the theory says that the bias factor is data dependent. Hence why @ShaopengLiu1 is working on #20 to test this on realistic data sets. He too is seeing difference of roughly 10-30%, but it complete depends on what the actual containment index is (eg: very small or very high true containment index means k-mer sizes with bot <= k < top are better estimates). Hence also the proviso in the (very out of date) Readme.md. And hence also why this streaming, ternary search tree, multi-kmer size approach has not been published yet (only the non-streaming, bloom-filter approach). Details are still being worked out for the intermediate k-mer sizes bot <= k < top.

  1. Since Sourmash is estimating a single k-mer size containment index, if you want an "apples-to-apples" comparison, you should indeed be using K-K-1 only when using StreamingQueryDNADatabase.py.

  2. Also, keep in mind that while Sourmash is using scaled hashes (iirc), StreamingQueryDNADatabase.py results will only be as accurate as the number of hashes you chose via -n in MakeStreamingDNADatabase.py. The default value of n=500 is quite small, and I would recommend always using as large of an n as you can tolerate (resource-wise). Eg. Metalign uses -n 1000 or -n 2000. As indicated in equation (2.7) (which is a lower bound on the streaming, multi-kmer size approach when k-mer size = top) the containment index accuracy exponentially improves with increasing n (note that equation uses k instead of n, but it's still up there in the exponent).

luizirber commented 4 years ago
1. Please do note that that it is indeed _expected_ that when using `StreamingQueryDNADatabase.py <snip> bot-top-diff`, then the containment index estimates for k-mer sizes `bot <= k < top` will indeed be off (sometimes significantly so) while k-mer size `k=top` will be at least as accurate as the [manuscript](https://doi.org/10.1016/j.amc.2019.02.018) bloom-filter approach.

I'll also run for "cmash paper" (since I only have analysis with the new streaming cmash), and see how they compare.

2. Also, keep in mind that while `Sourmash` is using scaled hashes (iirc), `StreamingQueryDNADatabase.py` results will _only_ be as accurate as the number of hashes you chose via `-n` in `MakeStreamingDNADatabase.py`. The default value of `n=500` is quite small, and I would recommend _always_ using as large of an `n` as you can tolerate (resource-wise). Eg. `Metalign` uses `-n 1000` or `-n 2000`. As indicated in [equation (2.7)](https://www.biorxiv.org/content/10.1101/184150v1.full.pdf) (which is a lower bound on the streaming, multi-kmer size approach when `k-mer size = top`) the containment index accuracy exponentially improves with increasing `n` (note that equation uses `k` instead of `n`, but it's still up there in the exponent).

I did use -n 1000 and -n 100000 (to match the "high-accuracy" in this section of the mash screen paper. Interestingly, when the true containment is very small (0.01 to 0.2) CMash is calculating it correctly, but when it's close to 1 it is underestimating quite a bit (worst case is 0.51, when it should be 0.995).

luizirber commented 4 years ago

I'll also run for "cmash paper" (since I only have analysis with the new streaming cmash), and see how they compare.

Done, and...

I did use -n 1000 and -n 100000 (to match the "high-accuracy" in this section of the mash screen paper. Interestingly, when the true containment is very small (0.01 to 0.2) CMash is calculating it correctly, but when it's close to 1 it is underestimating quite a bit (worst case is 0.51, when it should be 0.995).

For "cmash paper" results are <1% from ground truth (when n=100000), and <5% for n=1000. I'm going to use these numbers for my comparisons.

dkoslicki commented 4 years ago

@luizirber I would be interested to see when non-“CMash paper” is underestimating badly (as in, the worst case of 0.51 when it should be 0.995). Maybe you could send me the training/testing data? Might be a usage issue (due to the constant flux of the streaming version of CMash), or could be some other bug or issue you’ve identified (since at least theoretically, high containment index is when this approach should work best).

luizirber commented 4 years ago

I would be interested to see when non-“CMash paper” is underestimating badly (as in, the worst case of 0.51 when it should be 0.995). Maybe you could send me the training/testing data?

This is all coming from my thesis repo, here's a table with the containments: https://nbviewer.jupyter.org/github/luizirber/phd/blob/491bc7b/experiments/smol_gather/notebooks/analysis.ipynb#CMash-with-TST-(new-version) and for downloading the data and generating the results you can do

$ git clone https://github.com/luizirber/phd && cd phd
$ conda env create --force --file environment.yml
$ conda activate thesis
$ cd experiments/smol_gather && snakemake --use-conda

That last command might take some time to run, so if you want only the cmash results you can do snakemake --use-conda outputs/cmash/SRR606249.csv outputs/cmash/SRR606249-k{21,31}-n{1000,1000000}.csv outputs/cmash_paper/SRR606249-k{21,31}-n{1000,1000000}.csv for k=(21,31) and n=(1000,1000000) or, if you want to see all the commands that would be executed without using snakemake to run them, you can also use the -np flags (-n for dry run, -p for printing commands).

(I was also running for k=51 for new CMash, but the paper version fails during NodeGraph construction, which is a bit embarassing for me since it's using khmer :disappointed: )