Metadata model update - Githubissues

mdoering commented 3 years ago

The current ChecklistBank (CLB) dataset metadata model is an evolution from the previous fully flat ACEF model. It splits authors from editors and parses them out into a list of simple Person instances (given-, familyName, email, orcid). It does similar with the ACEF organisations which become a list of Organisations (department, name, city, state, country).

We need to be able to convert metadata between EML, the GBIF JSON API, DataCite (DOI), Plazi and the ColDP metadata.yaml which we can still update to match closely our internal model. Before we do an annual release this summer it would be good to have a finalised metadata model.

There are a couple of obstacles or possible improvements to consider:

the separation of Person from Organisation does not allow to affiliate a person with an organisation. It also does not allow an organisation to be the contact or have an email or website. EML, GBIF & DataCite follow an agent approach that can be either a person, a person with an affilitation to an organisation or just an organisation.
we want to list other contributors than just authors and editors. Programmers, reviewers, publishers, there are many other roles of interest to capture and give attribution to. EML allows this. See https://github.com/CatalogueOfLife/portal/issues/127
the citation string comes in various formats and it would be good if we could generate it consistently from atomised parts
EML has a geographic, taxonomic and temporal coverage, both in free text but also in a more structured way. CLB has geographicScope and a group field which gives the taxonomic scope. We should provide all 3 scopes as free text as they are all relevant to taxonomy.
alternative identifiers are missing. We need a way to store a DOI and any other id that we know about for the dataset. The GBIF key is the only explicit other identifier supported right now.
COL is a continuous project which issues fixed releases on a monthly basis. It is desirable to summarize the changes that happened since the last version was issued. This was done previously in a "whats new" html page for each release. CLB should allow to capture this, e.g. in a simple free text changes field. See #867
consider how to cite previously published data sources that may serve as future COL data sources. For example a taxonomic publication such as Sherborne that may serve as a source for the extended catalogue. There should be a mechanism for citing the original publication. Plazi does this by creating dataset metadata in MODS for a published article. See http://tb.plazi.org/GgServer/xslt/31F96F41E3E002BD88985A4F3A20E45A
we should stay as simple as we can. EML can be overwhelming and it is better to have a smaller but well defined model that can be actively managed.

mdoering commented 3 years ago

Should we just implement the GBIF EML profile which is a defined subset of EML 2.1.1? This subset not only leaves out areas of EML, but also reduces cardinality for quite a few properties. Some things like keywords, links to homepage/resource or the license are also unnecessarily complex. EML is a full fledged metadata standard that is great when metadata is the main thing you are worried about. But it is easy to get wrong and roundtrips between implementations are likely to be lossy.

The subset selected for the GBIF profile has some collection / occurrence oriented bits in there which COL would not need. GBIF also has a hybrid approach in storing some bits in the registry which is then readily available in the GBIF JSON API. Other bits of the EML are just available via the XML document which is stored and accessible via the API. The original idea about the GBIF registry was to support and store various metadata file formats, e.g. ISO 19139 or the FGDC Biological Data Profile, and only keep the core fields in a database. See IPT-Metadata-Profile-Additions_v3.doc from a year ago. EML so far has clearly dominated GBIF.

In general I am concerned about increased complexity and thus increased resources needed to manage metadata. COL does actively manage most metadata that matters to COL. It is done for good reasons as providers often are sloppy with metadata and citations and author listings are not very consistent in their format. Here a more atomised model would help to provide consistency in formatting. I would therefore think it is best to evolve the current model and try to keep it as simple as needed. Guidance by use cases is good. I rather extend the model in the future when we need to than to start out with a very complex model that gives us troubles from the start.

mdoering commented 3 years ago

It's reassuring to see EML 2.2 added support for BibTex and markdown.

mdoering commented 3 years ago

A proposal for a new model has been added to ColDP here. together with an example for a COL annual 2021 release.

mdoering commented 3 years ago

Based on #1001 we decided to use these properties with fixed "roles":

contact
creator
editor
publisher
distributor
contributor (with a flexible notes field to declare a role)

I would now even propose to collapse distributor into contributor, leaving just classic citation roles (creator/author, editor, publisher), a single point of contact and a flexible contributor list.

CatalogueOfLife / backend

Metadata model update #989