[Testing] Check benchmarking result for Dobrovolny Method and optimize

gregdenay commented 2 months ago

Benchmarking using the FooDMe dataset published in Denay et al. 2023, after correction for Macropodideae (expected at familly level) and Dama dama (not detected by the Laboratory method), and two apparently switched samples (119 and 120):

qualitatively 100% same results between MIDORI lrna and NCBI core_nt
5 False negative results, two of which are a known miss (ERR10436119 and 120), 3 are also found in FooDMe1
10 False positive results, interestingly only a few overlaps with the 10 form FooDMe1 but alwasy <1% of total

We get at the genus level, and 0.1% cutoff a precision of 98,19% and recall of 99,08%, which I think is pretty neat. I don't see much use in spending more time on optimization at this point, maybe some adjustements will come for the user side.

Maybe we can add a small report on this to the doc later on.

marchoeppner commented 2 months ago

Very nice. The only thing I still have on my radar is that issue with Cutadapt letting through reads that are clipped on only one side - which I suppose could account for some of the low-frequency noise?

Adding the benchmark metrics to the documentation is definitly a good idea.

gregdenay commented 2 months ago

Good point, is this something we want fixed for v1.0? It looks like low effort but I'm not sure the Report -> JSON -> MQC is that easy.

I'll add a doc issue for validation data so we don't forget

marchoeppner commented 2 months ago

Well, we already have the module ready to go from before the change - so that is zero effort. But the JSON thing... no clue. I'll have a look next week to see if this is even feasible (not sure about the JSON contents).

bio-raum / FooDMe2

[Testing] Check benchmarking result for Dobrovolny Method and optimize #40