CatalogueOfLife / general

The Catalogue of Life
49 stars 5 forks source link

What exactly does /name/matching do? #79

Closed sckott closed 3 years ago

sckott commented 3 years ago

It's not clear to me what the /name/matching route does exactly.

mdoering commented 3 years ago

Yes, it is the matching against the names index which is a rather internal thing that builds up a unique index of all distinct names found across all datasets in ChecklistBank. Every name in any dataset is automatically matched against this index and the match is stored in nameIndexId and nameIndexMatchType, e.g.: http://api.catalogue.life/dataset/3/name?limit=2

This allows us to list all usages of the same name across datasets and is also the basis for detecting whether a name is already present in COL or not and should be added similar to how we build the GBIF Backbone.

I guess the next question then is what is considered the same name? This is a difficult question and I expect that to still change. There is an discussion here: https://github.com/CatalogueOfLife/general/issues/35

Currently the implementation matches by the canonical name, it's authorship and rank. Authorship matching is rather loose, but name matching is pretty strict and only allows for a few common misspellings frequently found in epithets (silent h, gender suffix, double letters, i/y), but not in uninomials. Suprageneric ranks are all considered to be the same, otherwise a different rank results in a different match. Initially we also differentiated between nomenclatural codes, but as this information is not always present it resulted in too many distinct names. If no match is found a new entry is inserted if the request was based on a name present in one of the checklistbank datasets. If you use the API externally we will just return a NoMatch. We could change this for authorized parties if useful.

NameIndexIds are not meant to be stable but rather internal. As we improve matching rules we need to recluster all names and rebuild that names index, assigning new ids and rematching all datasets. This won't happen too often, but is needed for improving our algorithms.

sckott commented 3 years ago

Thanks! That's helpful. This will be good info to have for describing how the route works

Sorry if this veers a bit off topic 😬 ... I'm trying to implement COL+ in the taxize R package, and /name/matching probably isn't a great fit for my use case there since you already have to have a correct name. If I want users to be able to search COL+ (using /nameusage/search) against a single dataset (so they aren't dealing with similar results for a name search for many datasets and not knowing which to use) what dataset would that be? 3? 3LR? Some other dataset? Is 3 similar to GBIF backbone?

mdoering commented 3 years ago

COL (we dont use COL+ anymore as it referred to the now finished project) is managed in dataset 3 which is the working draft. From there we issue +/- monthly releases which become a dataset on their own. We keep them for some time, but eventually we will delete all but one per year which has "long term support". As users often want access to the latest release of COL without knowing the datasetKey, we offer the magic 3LR key which always redirects you to the datasetKey of the latest release. So I guess 3LR is what you should use if people want to access COL. It is the key we use on the new COL portal to query the API: http://www.dev.catalogue.life/

mdoering commented 3 years ago

If users want to work against a different dataset it might be good to allow them to pick their own. That way you can work with any of the datasets imported into ChecklistBank: https://data.catalogue.life/dataset

sckott commented 3 years ago

Thanks, very helpful.