Investigate if ChecklistBank could expose the readonly GBIF v1 Species API

mdoering commented 4 months ago

What are the differences between the v1 GBIF Species API and ChecklistBanks data model. Are there any true blockers?

mdoering commented 4 months ago

Name parser

this is straight forward, no problems here

mdoering commented 4 months ago

Searching names

Species suggest and search both have similar parameters and return types. The exact behavior of the search (scoring/ranking) is likely to be different:

datasetKey UUID needs to be mapped to CLBs int datasetKey
constituentKey UUID needs to be mapped to CLBs int datasetKey -> sourceDatasetKey
rank OK
higherTaxonKey OK, but will be a string
status OK
extinct OK
habitat OK = environment
threat: MISSING ! but could be added
nameType OK
nomenclaturalStatus: very different vocabulary is being used in CLB. I would think this is a very niche parameter that would not be a blocker
origin: OK, but slightly different vocab values. Not all can be mapped
issue OK, but rather different vocab values. Not all can be mapped
hl: highlighting is not yet supported in CLB (and troublesome to implement)
limit/offset: OK
facet: OK (might be some other facet names we could map - and available facets also differ)
facetMincount: NOT SUPPORTED
facetMultiselect: NOT SUPPORTED
facetLimit: NOT SUPPORTED
facetOffset: NOT SUPPORTED

Return type

no Linnean ranks, but could be added and is desireable as users have already requested it: https://github.com/CatalogueOfLife/backend/issues/1122
numDescendants: NOT SUPPORTED, but could be for immutable datasets
numOccurrences: NOT SUPPORTED, I wonder if that is even still in use in GBIF? We could add this by calling the GBIF API to retrieve counts
descriptions: NOT SUPPORTED, but there is a generic TaxonProperty extension that maybe could be used instead. Or a new extension being added which isn't such a big thing.
vernacularNames: all OK, but some properties are missing and would need to be added:
- lifeStage
- plural

mdoering commented 4 months ago

Species response Type: see above. Additionally:

deleted: CLB releases are immutable and the way deleted identifiers work is different. We can resolve older, now deleted IDs, but to search & work across them all is difficult and maybe not possible
lastCrawled: OK (but can also be uploads)
lastInterpreted: OK, but really always the same as crawled

mdoering commented 4 months ago

v1 methods which do not exist at all:

/species/{usageKey}/toc
/species/{usageKey}/speciesProfiles we only keep a few infos directly on the taxon as most these infos are 1:1 and make no sense in an extension. DwC forced us that way. E.g. extinct, environment, livingPeriod exist, but lifeForm, habitat, ageInDays, sizeInMillimeter, massInGram do not exist and would have to be TaxonProperty records. Doable, but quite some mapping effort going on
/species/{usageKey}/metrics: not existing at all. Would need to be precalculated and stored similar to the flat classification

mdoering commented 4 months ago

Identifiers are the biggest problem. ChecklistBank has compound keys with datasetKey (int) and a dataset scoped id (String) which is the original identifier from the source. While v1 has a single int key which is unique across all datasets.

COL stable identifiers are short string, but can be converted bidirectionally into an int. That won't work for other dataset identifiers

MattBlissett commented 4 months ago

Backbone taxon keys are used in other GBIF APIs:

Occurrence search
Map tile filter
Quarterly analytics (in the CSV data, just kingdom keys) which is used by the country reports.
Existing downloads which were created with a taxon filter

I can't think of an exposure of non-backbone keys, things like the IUCN Red List resolution during interpretation don't store the keys.

mdoering commented 4 months ago

Does that mean we cannot change the keys to not break the other APIs or is it a matter of (not) changing the data type from int to string? If the APIs would accept both an old backbone integer and a new string one we might be able to offer a smooth transition. Old integers would be mapped internally to the new ids which could also be submitted directly then.

Note also that there are 17 accepted kingdoms in COL these days, mostly viruses.

CatalogueOfLife / backend

Investigate if ChecklistBank could expose the readonly GBIF v1 Species API #1320