Closed dkoslicki closed 4 years ago
@cmcolbert To create the true x
, after running make_mock_metagenomes.sh
~/Documents/bbmap/./bbmap.sh in=mock_16S_metagenome.fa ref=mock_reference.fa out=/dev/null covstats=covstats.txt twocolumn=t
The first column identifies the reference genome. The second column of covstats.txt
is proportional to the true x
(as per this comment).
I suggest creating a simple python script that:
bbmap.sh
command (in python, you can run command line scripts with something like subprocess.run("<command to run>", shell=True, stdout=subprocess.DEVNULL)
)covstats.txt
,x
of the correct dimension (i.e.put the abundances in the correct place based on the training database used by MinDivLP.py
). Put this script in PythonCode/experiments/simulated_benchmark
and call it something like make_true_x.py
Here are various parameters against l1 error in the case of a single species sample with full 10000 species reference.
Thanks @cmcolbert! So apparently the support is being accurately reconstructed (given no l1 errors of 2). It's not surprising that the l1 error decreases as a function of small_k
(given that this predominantly controls the abundance estimation).
Most likely, the poor performance is due to the bbmap pipeline only providing a proxy for the true_x
. If we wanted to, we should move to an actual metagenomic read simulator (such as this, this, or this that last one is old, but I have script for constructing true_x
from the outputs), but I don't think that's high priority at this point.
What should be high priority is: a plot of % support reconstructed as a function of large_k
and small_k
combinations (since we can't trust bbmap's abundance estimation, but should be able to trust its ability to recapitulate the correct support).
It's my pleasure, @dkoslicki! And I am currently still running the same program for varying support sizes, and I can create those graphs once it is completed. It's almost 2/3 done at the moment.
For the same single species simulation as in the previous graph but with the mock metagenome being the single full sequence, each set of parameters reconstructed true_x
perfectly.
That's good to know at least! Note, I am about to push some changes to allow simulated_benchmark.py
to be run no matter where bbmap is installed (and a few other fixes). So you may need to adjust how you are calling it (i.e. you now need to provide a location of the bbmap directory via simulated_benchmark.py -b <bbmap_directory>
. I've also changed make_mock_metagenomes.sh
as I noticed that you added two CLI arguments for coverage and support size, so I added these in as well. Let me know if I stomped on anything you were using and we can address it
Just for posterity's sake: it looks like there is quite the sensitivity on lambda
(i.e. const
). Note:
import pandas as pd
import seaborn as sns
data = pd.read_csv('big_results.csv')
subset = data[(data["small k"] == 6) & (data["large k"] == 10) & (data["coverage"] == 1000) & (data["q"] == .01)]
sns.lineplot(x="support size", y="l1 error", data=subset, style="const")
And it looks like MinDivLP is doing better than Quikr: via:
fig = plt.figure()
data = pd.read_csv('big_results_with_quikr.csv')
subset = data[(data["coverage"] == 1000)]
ax1 = sns.lineplot(x="support size", y="l1 error", data=subset, color="b")
ax2 = sns.lineplot(x="support size", y="quikr error", data=subset, color="r")
fig.legend(labels=["MinDivLP", "Quikr"])
plt.ylabel("L1 error")
plt.title("q=0.01, const=1000, small_k=6\n large_k=10, coverage=1000")
As such, I posit the following:
A*x_true
not matching y
) in the form of non-uniform sampling of reads. I confirmed this by looking at how it's generating the reads, and they are biased towards the center (i.e. missing the ends of the 16S sequences).lambda
valueSo: let's call this benchmark performance fine for now. Especially since we want to concentrate on recovering the taxonomic profile, not the exact organisms in the sample.
Here are graphs of results from support sizes 2, 5, and 15 (25 is still running):
This last graph is encouraging since large small_k values recovered the whole support. To be clear, the proportion of support recovered is the size of the support in common for x_star
and true_x
divided by the support of true_x
. (Note: there are 18 points for each combination of large_k
and small_k
; many overlap)
Closing this as sufficiently tested for now (and can revisit with a different read simulator in the future if needed).
This will use bbmap for a simple benchmark of accuracy on simulated data.
x
from the simulated mock metagenomeThe above is to be repeated:
mock_reference.fa
(best case scenario: reference database contains only organisms in the simulated mock metagenome)97_otus_subset.fasta
as the training database (more realistic: reference database contains organisms not in the simulated metagenome)make_mock_metagenomes.sh
that usesbbmap
to introduce errorsmock_reference.fa
97_otus_subset.fasta
as the training database