emepyc / Blast2lca

Calculates the lowest common ancestors of each query sequence in a Blast result
GNU General Public License v2.0
31 stars 9 forks source link

Dynamic programming matrix was not filled properly #5

Open milot-mirdita opened 6 years ago

milot-mirdita commented 6 years ago

During evaluation of the tool i found that the DP matrix for the RMQ was not filled properly, resulting in a lot of nodes with LCA root. For example maus + human results in the root node.

Thank you for the great implementation in all other regards. I have ported the code to C++ and integrated it into our homology search, clustering and metagenomics suite MMseqs2.

emepyc commented 6 years ago

Thanks for the PR! You mention that you have ported the code. Are you still relying on GI numbers? if not, what approach are you following for acc => taxid conversion?

milot-mirdita commented 6 years ago

We work only with Uniprot, and that is sufficiently well annotated with NCBI taxons.

MMseqs2 does the annotation to Uniprot accessions, which are then mapped to NCBI taxons, which the LCA tool can then read.

We also implement a 2bLCA like approach to get more reliable LCAs.

emepyc commented 6 years ago

Ah, ok. Thanks

milot-mirdita commented 6 years ago

By the way, do you have any manuscript that we could cite?

emepyc commented 6 years ago

No, not really. I never tried to publish this tool. But thanks for asking :-)