gbif / registry

GBIF Registry
Apache License 2.0
34 stars 15 forks source link

Synchronize with Index Herbariorum - Collections and institutions #167

Closed ManonGros closed 4 years ago

ManonGros commented 4 years ago

Before we start

These are my assumptions about the GrSciColl registry:

  1. We want to use GrSciColl to link the institutions and collections to the specimens available on GBIF (mainly via the institutions and collections codes).
  2. We want to avoid repeating/duplicating the efforts of other registries. Since some the data is already maintained by IH, we want to use IH to maintain the information as much as possible.
  3. We want entries in GrSciColl to be stable, with clear identifiers in order to promote and link citations to.

Option 1: Always Map IH to Institutions

Right now, entries in IH describe mostly institutions. In the context of IH, it makes sense since we are talking about herbaria only. The problem is that GrSciColl is a broader context where the herbaria/botany part of an institution cannot always represent an institution.

Example of resulting issues

Let's take an example that illustrate the problem: UWO

An other example would be ANSP which also has an arthropod collection but is described a diatom herbarium in GrSciColl and in IH.

An other type of problem is the conflicts of information. We have some cases where the description of an institution on GrSciColl is more generic. For example, in the case of LUX, the information was rearranged on GrSciColl:

Possible solutions

With IH mapped to institutions in GrSciColl, we have two possible solutions:

Option 2: Map IH entries to collections

Conceptually, it would make more sense for herbaria to be collections in GrSciColl. In a way, they are, "botany collections". By this, I mean that each IH entry should be a collection attached to an institution. More ideas on how this could work below.

Advantages

Overall, I think it could make GrSciColl more coherent:

How this could work

This is just some ideas to be discussed. Here is what we could try to achieve:

  1. Each IH entry will make a collection attached to an institution in GrSciColl.
  2. If the GrSciColl institution doesn't exist, create one from information available in IH (name, code, address, etc. everything but taxonomic coverage and other collection specific info). Some info, such as address might be the same between collection and institution when that happens but it is ok.
  3. The institution code and collection code can be the same unless specified otherwise.
  4. When synchronising, unless specified otherwise (with a tag or a checkbox?), the info from IH can update both collection and institution. Otherwise, only the collection is updated.

I tried to illustrate this with the ANSP example: idea_IH_synch 001

Obviously, this would be far from perfect, but this makes more sense to me that mapping everything to institutions. Any thoughts on this? Did I forget anything?

Issue related: https://github.com/gbif/registry/issues/159

ManonGros commented 4 years ago

To have an idea on how this would impact linking GBIF specimens to collections and institutions, I checked a few botanical collections on GBIF. Here are these collections and the codes they use:

It seems like the tendency is to use mostly the same code for institution and collection (or skip one of them).

MortenHofft commented 4 years ago

Just to iterate option 2: so assuming the codes match IH and are unique in GrSciColl that would mean:

The important part is below i guess:

The institution code and collection code can be the same unless specified otherwise.

ManonGros commented 4 years ago

Logic for creating entities

  1. update GrSciColl entities only if we have IRN in the identifier (otherwise skip). It can mean that IH can update both an institution and a collection.
  2. if there are several matches of the same kind, create GitHub issue.
  3. for an IH entity, if no collection in GrSciColl but institution in GrSciColl, create just collection attached to existing institution + update institution.
  4. for an IH entity, if no collection in GrSciColl and no institution in GrSciColl, create institution (see details of fields below) and create collection + put IRN for both.
  5. for staff members, link or unlink staff members for institutions and collections that have IRN.

Details of fields to update

IH fields: Name: Inst + Coll Herbarium Code: Inst + Coll when creating new entry AND create GitHub issue if code different when updating entries (skip update in that case) Current Status: Inst + Coll Correspondents: see staff Contact: Inst + Coll Address: Inst + Coll Coordinates: Inst URL: Inst + Coll Taxonomic Coverage: Coll Geography: Coll Notes: Coll Number of Specimens: Coll Date Founded: Inst Incorporated Herbaria: Coll (create a field for it "Incorporated collection") Important Collectors: leave it out for now but in the future Coll with maybe new field? TABLE of Specimens/collection: leave it out for now but in the future Coll with maybe new field?

Identifiers

Keep the identifiers where they are.

When will we decide to use DOIs??

marcos-lg commented 4 years ago

The next fields will be added to the GrSciColl Collection entity in order to map some of the fields from IH:

MortenHofft commented 4 years ago

@asturcon what is the types for these fields? single line strings, text blocks, markdown, numbers, uuids?

marcos-lg commented 4 years ago
MortenHofft commented 4 years ago

same institutions is appearing multiple times in IH and hence GrSciColl. E.g. http://sweetgum.nybg.org/science/ih/herbarium-details/?irn=126771 http://sweetgum.nybg.org/science/ih/herbarium-details/?irn=126772

This seem to be a case of IH not splitting institutions and collections and hence have to create 2 entities with the same information, simply to have 2 codes for the 2 collections from the institution.

Now that we have decided (in agreement with IH) to always create an implicit collection, we can arguably delete one of the institutions. After syncing with IH and before adding/syncing more data (iDigBio), perhaps we should run a deduplication on the institutions in GrSciColl? It is also possible to do at a later stage, we might just need to merge more data at that point.

marcos-lg commented 4 years ago

In production and scheduled to run weekly.