KoslickiLab / DiversityOptimization

Minimizing biological diversity to improve microbial taxonomic reconstruction
MIT License
0 stars 0 forks source link

Benchmark 16S vs QIIME #9

Open dkoslicki opened 4 years ago

dkoslicki commented 4 years ago

Goal: compare the performance (in terms of accuracy, speed, and RAM usage) of the MinDivLP approach, and a standard QIIME2 workflow.

If you have problems with getting QIIME2 to work, Galaxy may be your friend as it has dada2 and mothur already installed (so wouldn't need to worry about tool installation issues) and some pre-built workflows already exist.

If time constraints become a problem, focus on the Naive Bayes classifier primarily in QIIME2 (as indicated by this publication).

cmcolbert commented 4 years ago

I have been working on getting the code from tax-credit up and running on my QIIME Virtual Machine, and I have encountered a few problems that I am working on fixing.

Here are the results from a small early test:

mock-Taxon Accuracy Rate-lineplots k = 4 mock-Taxon Detection Rate-lineplots k = 4

cmcolbert commented 4 years ago

After resolving the memory problems by creating a Python implementation of nonnegative least squares that can accept sparse matrices, sparse_nnls, based on the Julia implementation and resolving the problem of dividing by zero in the creation of f in MinDivLP by uniformly adding some a small number epsilon to np.power(B.T @ y_large, 1 - q), as seen in this commit here are the results from using small_k = 8 and small_k = 10 with large_k = 12, q = 0.01, and lambda = 10000, with MinDivLP in bright green: image image This was done with mock communities 1, 2, 12, 13, 14, 15, 16, 18, 19, 20, 21, 22, and 23 from mockcrobiota using the framework from tax-credit. These communities are all communities in tax-credit that use gg_13_8_otus with only one sample.

Here is how the tax-credit paper defines Taxon Accuracy Rate (TAR) and Taxon Detection Rate (TDR):

At a given taxonomic level, a classification is a

  • True positive (TP), if that taxon is both observed and expected.
  • False positive (FP), if that taxon is observed but not expected.
  • False negative (FN), if a taxon is expected but not observed.

These are used to calculate TAR and TDR as

  • TAR = TP/(TP + FP) or the fraction of observed taxa that were expected at level L.
  • TDR = TP/(TP + FN) or the fraction of expected taxa that are observed at level L.

The low taxon accuracy rate is possibly an artifact of including all nonnegative values in the taxonomic profiles generated by MinDivLP without a threshold, which can cause many false positives. This may be verified if we find the L1 error is low once the L1 error function is in place in the framework.

I ran the code to create the sensing matrix for 99% otus and k = 14 overnight, so I will run the same tests with this large_k = 14.

dkoslicki commented 4 years ago

The low taxon accuracy rate is possibly an artifact of including all nonnegative values in the taxonomic profiles generated by MinDivLP without a threshold, which can cause many false positives. This may be verified if we find the L1 error is low once the L1 error function is in place in the framework.

@cmcolbert Yes, it's actually typical in metagenomics studies to use a threshold: set some small value (say threshold = 0.001) and set any values below this to 0. Some people re-normalize after this, some don't. But you might consider also including the "thresholded" version in your analysis and plots.

Also, do you have any information about speed + RAM for our approach and the others?

cmcolbert commented 4 years ago

@dkoslicki I will make that change. And as for speed and RAM, I am currently computing runtimes in the tax-credit runtime section, and the RAM occupied by using 99_otus.fasta for up to small_k = 12 and large_k = 14 has always been less than 32 gigabytes, but it has varied throughout computation. I will take better note of memory in the run after computing runtimes.

Also, I ran some small tests to determine a good lambda value (only with mock community 1, large_k = 14, small_k = 8, and q = 0.01), and the taxon detection rate is not greatly affected by lambdain the case tested. However the taxon accuracy rate is much greater for lambda = 1000, the smallest value tested and the only value that sees a decrease in taxon detection rate. This makes sense since the smaller lambda value is inducing a smaller support size, so I will run this test again with a threshold to see if this difference fades.

Here are the results from a no-threshold test for determining a good lambda value: tar tdr (note: the format is mindivlplambda large_k q small_k)

cmcolbert commented 4 years ago

Here are graphs comparing the results of MinDivLP with and without a threshold of 0.001: image image Without a threshold is in bright green, and with a threshold is in burnt orange. These are both with small_k = 8, large_k = 12, q = 0.01, and lambda = 10000.

cmcolbert commented 4 years ago

Here are the results from the runtime analysis from tax-credit:

The time for reference database and query database both being 10000 for MinDivLP is not included because it was taking far too long. And the test for the varying query database sizes violates MinDivLP's assumption of low biological diversity, so it makes sense why it performed so poorly.

dkoslicki commented 4 years ago

That's quite unfortunate: slower, and less accurate. Maybe things will improve when we try to apply to WGS data a la #5, #6, #8, and #10? @cmcolbert let's set up a meeting with us and Simon to discuss these results. The only other way forward I see for 16S is using alternate metrics (eg UniFrac) in case we are not predicting exact organisms, but ones that are "close" taxonomically

cmcolbert commented 4 years ago

@dkoslicki That sounds good! I will send an email.

Also, do you think a larger large_k value could make up the difference in accuracy? It would take a decent amount of time to create the matrices, but their sizes aren't out of hand. For 99_otus.fasta, it appears that the k=14 sensing matrix takes up about 3GB of memory (from checking memory usage for a Python console with only the matrix loaded). Although I suppose it would take longer to form y_large.

cmcolbert commented 4 years ago

Here are plots comparing results of MinDivLP using large_k = 12 (blue) vs large_k = 14 (red): image image (Using const = 10000, q = 0.01, small_k = 8)

dkoslicki commented 4 years ago

@cmcolbert that's unfortunate... Care to compare the thresholded version of large_k=14 and the non-thresholded versions against the other methods to see how far off we are?

Lastly, any hope of trying (over a night or two) large_k=16? Given that the detection rate didn't drip, but the accuracy went up, this does indicate that we aren't using too large of a large_k yet. Just depends on if you machine can handle that size of large_k.

I have a sneaking suspicion that we may need to wrap this up and try the WGS instead (which should actually be more straightforward after some initial learning curve adjustments to the different data types and k-mer representations given the work you've done already for the underlying algorithms)

cmcolbert commented 4 years ago

@dkoslicki absolutely! Here are the thresholded (burnt orange) and non-thresholded (bright green) versions of large_k = 14 against the other methods:

image image (again with small_k = 8, q = 0.01, and const = 10000.)

Re large_k = 16: I will see if I can create the sensing matrix by running it overnight tonight and through the day tomorrow (and possibly over night again if necessary). I had tried it previously just for one night; I think my machine should be able to finish though with more time.

And I have begun working with KMC and setting up the code to form y in the WGS setting. I'll have a commit soon with that and possibly a couple questions to follow in #5.

cmcolbert commented 4 years ago

I ran the code to form the sensing matrix for k_size = 16 for three days, and it did not even meet the checkpoint of finishing kmer_counts_per_sequence. My machine may not be able to handle 16, or it just takes a very long time.

cmcolbert commented 4 years ago

Here is the L1 error graph for the thresholded (burnt orange) and non-thresholded (bright green) versions of MinDivLP at large_k = 14, small_k = 8, const = 10000, and q = 0.01.

L1_error_14-8

Calculating the L1 error for other methods will take a little more work since they provide feature classifications rather than relative abundances.

cmcolbert commented 4 years ago

unweighted-unifrac-lineplots weighted-unifrac-lineplots

Here are graphs of weighted and unweighted unifrac distance between the expected results and the results of various methods and parameters in each mock community used. MinDivLP is in green and thresholded MinDivLP is in black. Both MinDivLP results are using the parameters large_k = 14, small_k = 8, q = 0.01, and const = 10000. It is interesting to note that the thresholded MinDivLP results in the lowest weighted unifrac distance for mock 22 and results in fairly low unweighted unifrac distances for a few mock communities.