Closed ManonGros closed 4 years ago
To have an idea on how this would impact linking GBIF specimens to collections and institutions, I checked a few botanical collections on GBIF. Here are these collections and the codes they use:
It seems like the tendency is to use mostly the same code for institution and collection (or skip one of them).
Just to iterate option 2: so assuming the codes match IH and are unique in GrSciColl that would mean:
The important part is below i guess:
The institution code and collection code can be the same unless specified otherwise.
IH fields: Name: Inst + Coll Herbarium Code: Inst + Coll when creating new entry AND create GitHub issue if code different when updating entries (skip update in that case) Current Status: Inst + Coll Correspondents: see staff Contact: Inst + Coll Address: Inst + Coll Coordinates: Inst URL: Inst + Coll Taxonomic Coverage: Coll Geography: Coll Notes: Coll Number of Specimens: Coll Date Founded: Inst Incorporated Herbaria: Coll (create a field for it "Incorporated collection") Important Collectors: leave it out for now but in the future Coll with maybe new field? TABLE of Specimens/collection: leave it out for now but in the future Coll with maybe new field?
Keep the identifiers where they are.
When will we decide to use DOIs??
The next fields will be added to the GrSciColl Collection entity in order to map some of the fields from IH:
@asturcon what is the types for these fields? single line strings, text blocks, markdown, numbers, uuids?
same institutions is appearing multiple times in IH and hence GrSciColl. E.g. http://sweetgum.nybg.org/science/ih/herbarium-details/?irn=126771 http://sweetgum.nybg.org/science/ih/herbarium-details/?irn=126772
This seem to be a case of IH not splitting institutions and collections and hence have to create 2 entities with the same information, simply to have 2 codes for the 2 collections from the institution.
Now that we have decided (in agreement with IH) to always create an implicit collection, we can arguably delete one of the institutions. After syncing with IH and before adding/syncing more data (iDigBio), perhaps we should run a deduplication on the institutions in GrSciColl? It is also possible to do at a later stage, we might just need to merge more data at that point.
In production and scheduled to run weekly.
Before we start
These are my assumptions about the GrSciColl registry:
Option 1: Always Map IH to Institutions
Right now, entries in IH describe mostly institutions. In the context of IH, it makes sense since we are talking about herbaria only. The problem is that GrSciColl is a broader context where the herbaria/botany part of an institution cannot always represent an institution.
Example of resulting issues
Let's take an example that illustrate the problem: UWO
An other example would be ANSP which also has an arthropod collection but is described a diatom herbarium in GrSciColl and in IH.
An other type of problem is the conflicts of information. We have some cases where the description of an institution on GrSciColl is more generic. For example, in the case of LUX, the information was rearranged on GrSciColl:
Possible solutions
With IH mapped to institutions in GrSciColl, we have two possible solutions:
Option 2: Map IH entries to collections
Conceptually, it would make more sense for herbaria to be collections in GrSciColl. In a way, they are, "botany collections". By this, I mean that each IH entry should be a collection attached to an institution. More ideas on how this could work below.
Advantages
Overall, I think it could make GrSciColl more coherent:
How this could work
This is just some ideas to be discussed. Here is what we could try to achieve:
I tried to illustrate this with the ANSP example:
Obviously, this would be far from perfect, but this makes more sense to me that mapping everything to institutions. Any thoughts on this? Did I forget anything?
Issue related: https://github.com/gbif/registry/issues/159