OpenTreeOfLife / feedback

No code -- just an issue tracker for general feedback (sent here via GitHub's issues API)
1 stars 0 forks source link

tnrs approximate matching for multiple taxa not returning any matches #528

Closed LunaSare closed 3 months ago

LunaSare commented 2 years ago

While working with the R wrapper for OpenTree's APIs package rotl, I was getting different results when running the tnrs_match_names with a single taxon name and a multiple taxon names. When running it with a single taxa, it works as expected

> rotl::tnrs_match_names(names = "amphibians", do_approximate_matching = TRUE)

  search_string unique_name approximate_match ott_id is_synonym
1    amphibians    Amphibia              TRUE 544595      FALSE
  flags number_matches
1                    6

But when running it with a character vector of multiple taxa, it seems to ignore approximate matching:

> rotl::tnrs_match_names(names = c("amphibians", "canis"), do_approximate_matching = TRUE)

  search_string unique_name approximate_match ott_id is_synonym
1    amphibians        <NA>                NA     NA         NA
2         canis       Canis             FALSE 372706      FALSE
  flags number_matches
1  <NA>             NA
2                    2

Warning message:
amphibians are not matched 

I reported the issue in the R package GitHub repo https://github.com/ropensci/rotl/issues/134, and they looked into it and realized it comes from the OpenTree API. The following was originally posted by @fmichonneau in https://github.com/ropensci/rotl/issues/134#issuecomment-918948953:

$   curl -X POST https://api.opentreeoflife.org/v3/tnrs/match_names \
-H "content-type:application/json" -d \
'{"names":["Amphibians","Canis"], "do_approximate_matching":true}'
Output ```json { "context": "Mammals", "governing_code": "ICZN", "includes_approximate_matches": true, "includes_deprecated_taxa": false, "includes_suppressed_names": false, "matched_names": [ "Canis" ], "results": [ { "matches": [], "name": "Amphibians" }, { "matches": [ { "is_approximate_match": false, "is_synonym": false, "matched_name": "Canis", "nomenclature_code": "ICZN", "score": 1.0, "search_string": "canis", "taxon": { "flags": [], "is_suppressed": false, "is_suppressed_from_synth": false, "name": "Canis", "ott_id": 372706, "rank": "genus", "source": "ott3.3draft1", "synonyms": [ "Aenocyon", "Alopedon", "Alopsis", "Canix", "Chaon", "Dasycyon", "Dieba", "Dimenia", "Jacalius", "Lupulella", "Lupulus", "Lupus", "Lyciscus", "Mamcanisus", "Neocyon", "Oreocyon", "Oxygonus", "Oxygous", "Sacalius", "Schaeffia", "Simenia", "Thos", "Vulpicanis" ], "tax_sources": [ "ncbi:9611", "gbif:5219142", "irmng:1282727" ], "unique_name": "Canis" } }, { "is_approximate_match": false, "is_synonym": true, "matched_name": "Canis", "nomenclature_code": "ICZN", "score": 1.0, "search_string": "canis", "taxon": { "flags": [], "is_suppressed": false, "is_suppressed_from_synth": false, "name": "Atelocynus", "ott_id": 621180, "rank": "genus", "source": "ott3.3draft1", "synonyms": [ "Canis" ], "tax_sources": [ "ncbi:68721", "gbif:2434453", "irmng:1025378" ], "unique_name": "Atelocynus" } } ], "name": "Canis" } ], "taxonomy": { "author": "open tree of life project", "name": "ott", "source": "ott3.3draft1", "version": "3.3", "weburl": "https://tree.opentreeoflife.org/about/taxonomy-version/ott3.3" }, "unambiguous_names": [ "Canis" ], "unmatched_names": [ "Amphibians" ] } ```
mtholder commented 2 years ago

Thanks for reporting this. If Cannis is the second name, it works.

We have a shortcut to avoid expensive fuzzy searches if there is an exact match. It looks like we have the logic wrong when there are multiple names, because we should be passing all the non-exact names through to the fuzzy match when some match exactly (instead of just returning the exact match). I'm busy this week, perhaps @bredelings will have time to take a look.

[edited to correct grammar]

bredelings commented 2 years ago

I wonder if this was original behavior that was copied from the java implementation. Regardless, it does seem like it should be fixed / changed.

bredelings commented 2 years ago

Hmm.... what is happening here is that we are inferring the context "Mammals" if we have "canis" in the search list, but otherwise we infer "LIFE". We then fail to find "amphibians" inside "Mammals".

bredelings commented 2 years ago

@LunaSare @mtholder I think this should work if you specify that the context is Life.

Do you think we should change the default behavior, so that we always assume that the context is Life, when the context is unspecified?

We could, for example, only guess the context from the exact matches when there are at least three of them. But I think that magically changing the behavior when the number of exact matches changes could be very confusing for the users.

fmichonneau commented 2 years ago

Using "Life" across all terms seems the more expected behavior.

LunaSare commented 2 years ago

I agree! using “Life” as default when no context is specified is my expectation. Having different rules to guess a context might get too complicated in my opinion and would require some thorough additional documentation for users.

mtholder commented 2 years ago

I just checked the docs ( https://github.com/OpenTreeOfLife/germinator/wiki/TNRS-API-v3#match_names ) and the service is performing the documented behavior. Though the fact that Ben and I both thought this was a bug is a good indication that we agree with you that this is a very confusing default behavior.

A rotl workaround would be for rotl to add the context "Life" if the use does not specify a context. That would turn off the confusing "inferred contexts" part of this service. I don't mind changing the default behavior in v4, but I'm loathe to change the behavior on v3 given that it is documented and some folks may be relying on that behavior. In this case, I'm having a hard time imagining anyone relying on this weirdness. So, I'm persuadable.

mtholder commented 2 years ago

just to be clear: I don't mind changing the default behavior in v4 to using "Life" as the default context.

LunaSare commented 2 years ago

Ah! I see what you mean @mtholder! I think then that just specifying that the context is "All life" for multiple taxon searches, when no other context is known should give the desired behavior for v3, for now:

my_taxa <- c("amphibians", "canis", "felis", "delphinidae", "avess")

rotl::tnrs_match_names(names = my_taxa, context_name = "All life")

> resolved_names
  search_string unique_name approximate_match ott_id is_synonym
1    amphibians    Amphibia              TRUE 544595      FALSE
2         canis       Canis             FALSE 372706      FALSE
3         felis       Felis             FALSE 563165      FALSE
4   delphinidae Delphinidae             FALSE 698406      FALSE
5         avess        Aves              TRUE  81461      FALSE
  flags number_matches
1                    6
2                    2
3                    1
4                    1
5                    1
fmichonneau commented 2 years ago

Good point @mtholder! In the GitHub version of rotl, I changed the default value of the context for the tnrs_match_names function to use "All life". @LunaSare if you have a chance to test it, let me know how this works for you.

LunaSare commented 2 years ago

Sure @fmichonneau! I tried it after installing rotl from github and it works good!

> devtools::install_github("ropensci/rotl")
Output ```r Downloading GitHub repo ropensci/rotl@HEAD ✓ checking for file ‘/private/var/folders/ss/2tpkp325521_kfgn59g44vd80000gn/T/RtmpxL6RuJ/remotes94454493cd06/ropensci-rotl-13b568a/DESCRIPTION’ ... ─ preparing ‘rotl’: ✓ checking DESCRIPTION meta-information ... ─ checking for LF line-endings in source and make files and shell scripts ─ checking for empty or unneeded directories Omitted ‘LazyData’ from DESCRIPTION ─ building ‘rotl_3.0.11.900.tar.gz’ * installing *source* package ‘rotl’ ... ** using staged installation ** R ** inst ** byte-compile and prepare package for lazy loading ** help *** installing help indices ** building package indices ** installing vignettes ** testing if installed package can be loaded from temporary location ** testing if installed package can be loaded from final location ** testing if installed package keeps a record of temporary installation path * DONE (rotl) ```

> my_taxa <- c("amphibians", "canis", "felis", "delphinidae", "avess")
> resolved_names <- rotl::tnrs_match_names(names = my_taxa
+ )
> resolved_names
  search_string unique_name approximate_match ott_id is_synonym flags
1    amphibians    Amphibia              TRUE 544595      FALSE      
2         canis       Canis             FALSE 372706      FALSE      
3         felis       Felis             FALSE 563165      FALSE      
4   delphinidae Delphinidae             FALSE 698406      FALSE      
5         avess        Aves              TRUE  81461      FALSE      
  number_matches
1              6
2              2
3              1
4              1
5              1
> 
fmichonneau commented 2 years ago

FYI, this fix is now available in rotl 3.0.12 which was just published on CRAN.