Benchmark 16S vs QIIME - Githubissues

dkoslicki commented 4 years ago

Goal: compare the performance (in terms of accuracy, speed, and RAM usage) of the MinDivLP approach, and a standard QIIME2 workflow.

[x] Familiarize yourself with QIIME2 and the two, different interfaces.
[x] Read through the docs and the overview of workflows
[x] focus on the taxonomy classifiers section, as well as the taxonomic classification section
[x] Design and perform the bench-marking comparison
- [x] Use one or more mock communities to get the gold standard/ground-truth taxonomic profile (this may need to be hand constructed).
- [x] Compare to Naive Bayes classifiers trained on the Greengenes 13_8 99% OTUs full-length sequences, and potentially other taxonomy classifiers mentioned here such as the BLAST+ consensus taxonomy classifier, deblur, DADA2, and/or mothur.
- [x] Use the greengenes 13_8 99% OTU's FASTA file for the sensing matrix for the MinDivLP
- [x] make sure each method uses the same input (cleaned, trimmed) FASTA/Q file from the mock community.
- [x] For a few different (large, as much as your computer's RAM can handle) k-mer sizes, report the following metrics:
  - [x] Kingdom/Phylum/Class/.../Species level L1 norm to the gold standard
  - [x] Kingdom/Phylum/Class/.../Species level precision, sensitivity, and F1 score
  - [x] (lower priority) Weighted and unweighted Unifrac metric

If you have problems with getting QIIME2 to work, Galaxy may be your friend as it has dada2 and mothur already installed (so wouldn't need to worry about tool installation issues) and some pre-built workflows already exist.

If time constraints become a problem, focus on the Naive Bayes classifier primarily in QIIME2 (as indicated by this publication).

cmcolbert commented 4 years ago

I have been working on getting the code from tax-credit up and running on my QIIME Virtual Machine, and I have encountered a few problems that I am working on fixing.

There are indeed some problems with memory and the Virtual Machine. I tried simply allocating more memory to it, but it may be a more subtle problem than that. I tried creating sensing matrices on the VM, and with the full 99_otus.fasta, even k=6 gave a memory error. However, creating these matrices natively worked smoothly. I did so up to k=10 and will continue to higher values if memory allows.
Once the matrices are made, there is also a problem even having the matrices loaded. The VM had a memory error with small_k = 6, so the only test I ran was on small_k = 4. For accuracy tests, I can try to run the MinDivLP script natively if I am unable to correct the trouble with the VM.

Here are the results from a small early test:

mock-Taxon Accuracy Rate-lineplots k = 4 mock-Taxon Detection Rate-lineplots k = 4

The taxon accuracy rate and taxon detection rate appear to be calculating correctly. I will run the test natively with higher k sizes, and hopefully they will show these statistics going up. Precision, recall, and F-measure, however, were not calculated because these rely on particular feature IDs, which are not used in MinDivLP.
Finally, I mistakenly ran the code on the 99_otus_clean.fasta data originally, and I encountered an error with dividing by zero in the creation of f in MinDivLP and using such a value in NNLS. I wouldn't expect this to be a common problem, but I am wondering if we should have a way to account for this.

cmcolbert commented 4 years ago

After resolving the memory problems by creating a Python implementation of nonnegative least squares that can accept sparse matrices, sparse_nnls, based on the Julia implementation and resolving the problem of dividing by zero in the creation of f in MinDivLP by uniformly adding some a small number epsilon to np.power(B.T @ y_large, 1 - q), as seen in this commit here are the results from using small_k = 8 and small_k = 10 with large_k = 12, q = 0.01, and lambda = 10000, with MinDivLP in bright green: This was done with mock communities 1, 2, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, and 23 from mockcrobiota using the framework from tax-credit. These communities are all communities in tax-credit that use gg_13_8_otus with only one sample.

Here is how the tax-credit paper defines Taxon Accuracy Rate (TAR) and Taxon Detection Rate (TDR):

At a given taxonomic level, a classification is a

True positive (TP), if that taxon is both observed and expected.

False positive (FP), if that taxon is observed but not expected.

False negative (FN), if a taxon is expected but not observed.

These are used to calculate TAR and TDR as

TAR = TP/(TP + FP) or the fraction of observed taxa that were expected at level L.

TDR = TP/(TP + FN) or the fraction of expected taxa that are observed at level L.

The low taxon accuracy rate is possibly an artifact of including all nonnegative values in the taxonomic profiles generated by MinDivLP without a threshold, which can cause many false positives. This may be verified if we find the L1 error is low once the L1 error function is in place in the framework.

I ran the code to create the sensing matrix for 99% otus and k = 14 overnight, so I will run the same tests with this large_k = 14.

dkoslicki commented 4 years ago

The low taxon accuracy rate is possibly an artifact of including all nonnegative values in the taxonomic profiles generated by MinDivLP without a threshold, which can cause many false positives. This may be verified if we find the L1 error is low once the L1 error function is in place in the framework.

@cmcolbert Yes, it's actually typical in metagenomics studies to use a threshold: set some small value (say threshold = 0.001) and set any values below this to 0. Some people re-normalize after this, some don't. But you might consider also including the "thresholded" version in your analysis and plots.

Also, do you have any information about speed + RAM for our approach and the others?

cmcolbert commented 4 years ago

@dkoslicki I will make that change. And as for speed and RAM, I am currently computing runtimes in the tax-credit runtime section, and the RAM occupied by using 99_otus.fasta for up to small_k = 12 and large_k = 14 has always been less than 32 gigabytes, but it has varied throughout computation. I will take better note of memory in the run after computing runtimes.

Also, I ran some small tests to determine a good lambda value (only with mock community 1, large_k = 14, small_k = 8, and q = 0.01), and the taxon detection rate is not greatly affected by lambdain the case tested. However the taxon accuracy rate is much greater for lambda = 1000, the smallest value tested and the only value that sees a decrease in taxon detection rate. This makes sense since the smaller lambda value is inducing a smaller support size, so I will run this test again with a threshold to see if this difference fades.

Here are the results from a no-threshold test for determining a good lambda value: tar tdr (note: the format is mindivlplambda large_k q small_k)

cmcolbert commented 4 years ago

Here are graphs comparing the results of MinDivLP with and without a threshold of 0.001: Without a threshold is in bright green, and with a threshold is in burnt orange. These are both with small_k = 8, large_k = 12, q = 0.01, and lambda = 10000.

cmcolbert commented 4 years ago

Here are the results from the runtime analysis from tax-credit:

The time for reference database and query database both being 10000 for MinDivLP is not included because it was taking far too long. And the test for the varying query database sizes violates MinDivLP's assumption of low biological diversity, so it makes sense why it performed so poorly.

dkoslicki commented 4 years ago

That's quite unfortunate: slower, and less accurate. Maybe things will improve when we try to apply to WGS data a la #5, #6, #8, and #10? @cmcolbert let's set up a meeting with us and Simon to discuss these results. The only other way forward I see for 16S is using alternate metrics (eg UniFrac) in case we are not predicting exact organisms, but ones that are "close" taxonomically

cmcolbert commented 4 years ago

@dkoslicki That sounds good! I will send an email.

Also, do you think a larger large_k value could make up the difference in accuracy? It would take a decent amount of time to create the matrices, but their sizes aren't out of hand. For 99_otus.fasta, it appears that the k=14 sensing matrix takes up about 3GB of memory (from checking memory usage for a Python console with only the matrix loaded). Although I suppose it would take longer to form y_large.

cmcolbert commented 4 years ago

Here are plots comparing results of MinDivLP using large_k = 12 (blue) vs large_k = 14 (red): (Using const = 10000, q = 0.01, small_k = 8)

dkoslicki commented 4 years ago

@cmcolbert that's unfortunate... Care to compare the thresholded version of large_k=14 and the non-thresholded versions against the other methods to see how far off we are?

Lastly, any hope of trying (over a night or two) large_k=16? Given that the detection rate didn't drip, but the accuracy went up, this does indicate that we aren't using too large of a large_k yet. Just depends on if you machine can handle that size of large_k.

I have a sneaking suspicion that we may need to wrap this up and try the WGS instead (which should actually be more straightforward after some initial learning curve adjustments to the different data types and k-mer representations given the work you've done already for the underlying algorithms)

cmcolbert commented 4 years ago

@dkoslicki absolutely! Here are the thresholded (burnt orange) and non-thresholded (bright green) versions of large_k = 14 against the other methods:

(again with small_k = 8, q = 0.01, and const = 10000.)

Re large_k = 16: I will see if I can create the sensing matrix by running it overnight tonight and through the day tomorrow (and possibly over night again if necessary). I had tried it previously just for one night; I think my machine should be able to finish though with more time.

And I have begun working with KMC and setting up the code to form y in the WGS setting. I'll have a commit soon with that and possibly a couple questions to follow in #5.

cmcolbert commented 4 years ago

I ran the code to form the sensing matrix for k_size = 16 for three days, and it did not even meet the checkpoint of finishing kmer_counts_per_sequence. My machine may not be able to handle 16, or it just takes a very long time.

cmcolbert commented 4 years ago

Here is the L1 error graph for the thresholded (burnt orange) and non-thresholded (bright green) versions of MinDivLP at large_k = 14, small_k = 8, const = 10000, and q = 0.01.

L1_error_14-8

Calculating the L1 error for other methods will take a little more work since they provide feature classifications rather than relative abundances.

cmcolbert commented 4 years ago

$unweighted-unifrac-lineplots$ $weighted-unifrac-lineplots$

Here are graphs of weighted and unweighted unifrac distance between the expected results and the results of various methods and parameters in each mock community used. MinDivLP is in green and thresholded MinDivLP is in black. Both MinDivLP results are using the parameters large_k = 14, small_k = 8, q = 0.01, and const = 10000. It is interesting to note that the thresholded MinDivLP results in the lowest weighted unifrac distance for mock 22 and results in fairly low unweighted unifrac distances for a few mock communities.

KoslickiLab / DiversityOptimization

Benchmark 16S vs QIIME #9