CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

Metadata model update #989

Closed mdoering closed 3 years ago

mdoering commented 3 years ago

The current ChecklistBank (CLB) dataset metadata model is an evolution from the previous fully flat ACEF model. It splits authors from editors and parses them out into a list of simple Person instances (given-, familyName, email, orcid). It does similar with the ACEF organisations which become a list of Organisations (department, name, city, state, country).

We need to be able to convert metadata between EML, the GBIF JSON API, DataCite (DOI), Plazi and the ColDP metadata.yaml which we can still update to match closely our internal model. Before we do an annual release this summer it would be good to have a finalised metadata model.

There are a couple of obstacles or possible improvements to consider:

mdoering commented 3 years ago

See also #972 and the metadata alignment at https://docs.google.com/spreadsheets/d/1-zO5zHqul6VBwcCcYh8egMI2N6q6LUB7uCU247o727c/edit?ts=605b88c7#gid=352122413

mdoering commented 3 years ago

Should we just implement the GBIF EML profile which is a defined subset of EML 2.1.1? This subset not only leaves out areas of EML, but also reduces cardinality for quite a few properties. Some things like keywords, links to homepage/resource or the license are also unnecessarily complex. EML is a full fledged metadata standard that is great when metadata is the main thing you are worried about. But it is easy to get wrong and roundtrips between implementations are likely to be lossy.

The subset selected for the GBIF profile has some collection / occurrence oriented bits in there which COL would not need. GBIF also has a hybrid approach in storing some bits in the registry which is then readily available in the GBIF JSON API. Other bits of the EML are just available via the XML document which is stored and accessible via the API. The original idea about the GBIF registry was to support and store various metadata file formats, e.g. ISO 19139 or the FGDC Biological Data Profile, and only keep the core fields in a database. See IPT-Metadata-Profile-Additions_v3.doc from a year ago. EML so far has clearly dominated GBIF.

In general I am concerned about increased complexity and thus increased resources needed to manage metadata. COL does actively manage most metadata that matters to COL. It is done for good reasons as providers often are sloppy with metadata and citations and author listings are not very consistent in their format. Here a more atomised model would help to provide consistency in formatting. I would therefore think it is best to evolve the current model and try to keep it as simple as needed. Guidance by use cases is good. I rather extend the model in the future when we need to than to start out with a very complex model that gives us troubles from the start.

mdoering commented 3 years ago

It's reassuring to see EML 2.2 added support for BibTex and markdown.

mdoering commented 3 years ago

A proposal for a new model has been added to ColDP here. together with an example for a COL annual 2021 release.

mdoering commented 3 years ago

Based on #1001 we decided to use these properties with fixed "roles":

I would now even propose to collapse distributor into contributor, leaving just classic citation roles (creator/author, editor, publisher), a single point of contact and a flexible contributor list.