KoslickiLab / DiversityOptimization

Minimizing biological diversity to improve microbial taxonomic reconstruction
MIT License
0 stars 0 forks source link

Benchmark performance on simulated data #12

Closed dkoslicki closed 4 years ago

dkoslicki commented 4 years ago

This will use bbmap for a simple benchmark of accuracy on simulated data.

The above is to be repeated:

dkoslicki commented 4 years ago

@cmcolbert To create the true x, after running make_mock_metagenomes.sh , use a command like:

~/Documents/bbmap/./bbmap.sh in=mock_16S_metagenome.fa ref=mock_reference.fa out=/dev/null covstats=covstats.txt twocolumn=t

The first column identifies the reference genome. The second column of covstats.txt is proportional to the true x (as per this comment).

I suggest creating a simple python script that:

  1. runs the above bbmap.sh command (in python, you can run command line scripts with something like subprocess.run("<command to run>", shell=True, stdout=subprocess.DEVNULL))
  2. normalizes the second column of covstats.txt,
  3. and creates x of the correct dimension (i.e.put the abundances in the correct place based on the training database used by MinDivLP.py).

Put this script in PythonCode/experiments/simulated_benchmark and call it something like make_true_x.py

cmcolbert commented 4 years ago

parameters_vs_l1error Here are various parameters against l1 error in the case of a single species sample with full 10000 species reference.

dkoslicki commented 4 years ago

Thanks @cmcolbert! So apparently the support is being accurately reconstructed (given no l1 errors of 2). It's not surprising that the l1 error decreases as a function of small_k (given that this predominantly controls the abundance estimation).

Most likely, the poor performance is due to the bbmap pipeline only providing a proxy for the true_x. If we wanted to, we should move to an actual metagenomic read simulator (such as this, this, or this that last one is old, but I have script for constructing true_x from the outputs), but I don't think that's high priority at this point.

What should be high priority is: a plot of % support reconstructed as a function of large_k and small_k combinations (since we can't trust bbmap's abundance estimation, but should be able to trust its ability to recapitulate the correct support).

cmcolbert commented 4 years ago

It's my pleasure, @dkoslicki! And I am currently still running the same program for varying support sizes, and I can create those graphs once it is completed. It's almost 2/3 done at the moment.

cmcolbert commented 4 years ago

For the same single species simulation as in the previous graph but with the mock metagenome being the single full sequence, each set of parameters reconstructed true_x perfectly.

dkoslicki commented 4 years ago

That's good to know at least! Note, I am about to push some changes to allow simulated_benchmark.py to be run no matter where bbmap is installed (and a few other fixes). So you may need to adjust how you are calling it (i.e. you now need to provide a location of the bbmap directory via simulated_benchmark.py -b <bbmap_directory>. I've also changed make_mock_metagenomes.sh as I noticed that you added two CLI arguments for coverage and support size, so I added these in as well. Let me know if I stomped on anything you were using and we can address it

dkoslicki commented 4 years ago

Just for posterity's sake: it looks like there is quite the sensitivity on lambda (i.e. const). Note:

import pandas as pd
import seaborn as sns
data = pd.read_csv('big_results.csv')
subset = data[(data["small k"] == 6) & (data["large k"] == 10) & (data["coverage"] == 1000) & (data["q"] == .01)]
sns.lineplot(x="support size", y="l1 error", data=subset, style="const")

Figure_1

dkoslicki commented 4 years ago

And it looks like MinDivLP is doing better than Quikr: Figure_2 via:

fig = plt.figure()
data = pd.read_csv('big_results_with_quikr.csv')
subset = data[(data["coverage"] == 1000)]
ax1 = sns.lineplot(x="support size", y="l1 error", data=subset, color="b")
ax2 = sns.lineplot(x="support size", y="quikr error", data=subset, color="r")
fig.legend(labels=["MinDivLP", "Quikr"])
plt.ylabel("L1 error")
plt.title("q=0.01, const=1000, small_k=6\n large_k=10, coverage=1000")
dkoslicki commented 4 years ago

As such, I posit the following:

  1. bbmap is introducing a bunch of noise (as confirmed by the A*x_true not matching y) in the form of non-uniform sampling of reads. I confirmed this by looking at how it's generating the reads, and they are biased towards the center (i.e. missing the ends of the 16S sequences).
  2. as such, performance is sensitive to the lambda value
  3. MinDivLP is still doing much better than Quikr

So: let's call this benchmark performance fine for now. Especially since we want to concentrate on recovering the taxonomic profile, not the exact organisms in the sample.

cmcolbert commented 4 years ago

Here are graphs of results from support sizes 2, 5, and 15 (25 is still running): parameters_vs_l1oerror_s=2 parameters_vs_l1oerror_s=5 parameters_vs_l1oerror_s=15 k_values_vs_support_recovered This last graph is encouraging since large small_k values recovered the whole support. To be clear, the proportion of support recovered is the size of the support in common for x_star and true_x divided by the support of true_x. (Note: there are 18 points for each combination of large_k and small_k; many overlap)

dkoslicki commented 4 years ago

Closing this as sufficiently tested for now (and can revisit with a different read simulator in the future if needed).