gbif / checklistbank

GBIF Checklist Bank
Apache License 2.0
31 stars 14 forks source link

name, name_usage and citation data fields from GBIF API? #268

Open csbrown opened 1 year ago

csbrown commented 1 year ago

Feature Request:

The GBIF data model is available here. It would be very helpful if we could use the API to reconstruct the fields in the name, name_usage and citation tables. Some of the fields in the API have been renamed from the names in the tables (such as the name_usage.status being called "taxonomicStatus", I think), it seems. Yet others it is not entirely clear how to access them at all (such as the name_usage.pp_synomyn_fk field).

Is there a data dictionary that makes clear the relationship between the API fields and the database fields (and, bonus, where applicable, with the Darwin Core fields)? Also, which API endpoints would I need to use in order to reconstruct the name, name_usage and citation table data fields for a given taxon?

Use Case:

We are working on a project for invasive species management. Part of this involves being able to, as much as possible, unambiguously reference taxa, identify synonymous (e.g. vernacular) names for taxa, etc. It looks like GBIF has one of the best datasets for taxonomic nomenclature around, and instead of re-inventing the wheel, we would very much like to just lean on GBIF records as a sort of "taxonomic standard". Since we'll be having queries against taxonomic names frequently, it makes sense for us to mirror at least the names data locally so that we can do efficient querying against a local database, to save users' frequently-referenced taxa, etc. OTOH, since "invasive" species are generally a very small subset of all species, we would likely only need to mirror a small portion of the entire GBIF set, but still maintain the ability to pull in additional data as needed. The current thought is to use the data model that you all so graciously provide, but, when species are missing from our database, to allow users to search for species using the GBIF API, which we will then cache in our mirror.

csbrown commented 1 year ago

As an aside, if this seems like a bad plan or if there is a better plan, I would be grateful for literally any advice. It's not immediately clear to me the best way to use GBIF species search to allow users to pick a species that they are currently documenting, and to strore that info. E.g., Biologists find that Panthera Leo have taken up residence in Christiania, and they want to document this by referencing "the most canonical" record (or set of records?) in GBIF for Panthera Leo. How might they go about finding this information from the API and keeping track of it for matching records in the future?

mdoering commented 1 year ago

The mapping of database fields to the java classes is done in the mybatis sql mapper layer. There are not many write methods, so you cannot easily go back from an NameUsage instance to relational db records.

If you just wanna store GBIF taxa consider to just use the JSON directly and store that? Or even consider to just use the very simple database schema which we supply for the backbone? You can quickly load all backbone releases into that schema.

csbrown commented 1 year ago

Is there a tool to load the DWCA files into the pg backbone schema?

mdoering commented 1 year ago

We provide special "simple" dumps that you can load with psql as described here: https://hosted-datasets.gbif.org/datasets/backbone/README.html

csbrown commented 1 year ago

Cool. I think that I can find my way from here, maybe. I think the very simple database schema is a bit too simple for our needs, as we need the nubKey for cross-referencing with other databases, but it looks like the single-table design might suit our purposes, and we can customize it to work. Thanks for your advice.