epruesse / SINA

SINA - Reference based multiple sequence alignment
https://sina.readthedocs.io
GNU General Public License v3.0
40 stars 4 forks source link

Problems with LCA quorum #93

Open jgerken opened 3 years ago

jgerken commented 3 years ago

Hi Elmar,

we stumbled across a weird problem when classifying sequences with SINA. One of our users searched for a partial sequences which as classified as Archaea;Halobacterota; despite eight of the ten nearest neighbours having a classification of Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanomicrobiaceae;Methanoculleus;.

I created a minimal working example with a SILVA dataset reduced to the ten nearest neighbours. When you execute

./sina -i aligned.query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.95 --lca-quorum 0.8 --search-no-fast --lca-fields tax_slv -v -v -v
...
21:26:42 [log] lca_tax_slv: Archaea;Halobacterota;

I then searched for a very similar full-length sequence in SILVA an ran the following query:

./sina -i aligned.full-length-query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.5 --lca-quorum 0.8 --search-no-fast --lca-fields tax_slv -v -v -v
...
21:26:42 [log] lca_tax_slv: Archaea;Halobacterota;

But when I change the the --lcq-quorum to 0.79 I get the expected result:

./sina -i aligned.full-length-query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.5 --lca-quorum 0.79 --search-no-fast --lca-fields tax_slv -v -v -v
...
21:30:17 [log] lca_tax_slv: Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanomicrobiaceae;Methanoculleus;

Between the 2nd and the 3rd example, there is a floating point precision problem. I think the ratio that is compared with the value of lca_quorum needs to be rounded according to the precision of the floating point type used.

The difference between the first and the other two examples is that the first run returns one of the two sequences with a deviating classification of Archaea;Halobacterota;Methanosarcinia;Methanosarciniales;Methanosarcinaceae;Methanosarcina; as most similar neighbour whereas in the other two runs a sequence of the majority is selected. Thus, it seems that the LCA quorum always tries to use the taxonomy of the most similar neighbour instead of trying find the most common classification that fulfils the quorum.

Could you have a look?

Thanks Jan

minimal-example.tar.gz

epruesse commented 3 years ago

I'll have a look at it. Note on the side, you can write -vvv instead of -v -v -v. I assume you were using the 1.7.0?

jgerken commented 3 years ago

Thanks :) Yes, I used 1.7.0 but also tried 1.6.1 and 1.2.11 and the classification result was the same with all versions (did not check the nearest neighbors in details with the old versions, though).

epruesse commented 3 years ago

mind if I add the example to the unit tests?

epruesse commented 3 years ago

Simple enough. Classic int/float truncation issue. Should work now. I'm pushing 1.7.1, should be in bioconda within a day or so.

jgerken commented 3 years ago

mind if I add the example to the unit tests?

No, go ahead.

The submitted fix only fixes the difference between the second and third execution. The problem that the first test case

./sina -i aligned.query.fasta --prealigned --fs-kmer-no-fast --search --search-db SILVA-no-fields-reduced.opt.arb --search-kmer-len 10 --search-min-sim 0.95 --lca-quorum 0.8 --search-no-fast --lca-fields tax_slv -vvv

returns Archaea;Halobacterota; instead of Archaea;Halobacterota;Methanomicrobia;Methanomicrobiales;Methanomicrobiaceae;Methanoculleus; despite 8 of the 10 top hits having the latter classification is still present in 1.7.1.

epruesse commented 3 years ago

Darn - didn't see that part. Case of TL'DR I suppose.

epruesse commented 3 years ago

You are correct in your observation. The algorithm does an LCA between the best match and the others. Guess that was usually so close to what it's supposed to do that no-one notices. This will take me a day or two to fix. Don't have enough time in the evenings.