gbif / eml-profile

GBIF EML profile
0 stars 2 forks source link

Update GBIF's EML profile #5

Open MattBlissett opened 11 months ago

MattBlissett commented 11 months ago

Updating EML within GBIF

Ecological Metadata Language, EML, is the primary data standard used by Darwin Core Archives to provide metadata about a dataset — descriptions, information on geographic and taxonomic coverage, contacts, publishers and so on. It is also used as part of GBIF's API, and included in Darwin Core Archive data downloads.

Since 2011 we have been using EML version 2.1.1, extended with some additional properties including some from the Natural Collections Description Data (NCD) draft standard. The EML properties we recognize, as well as these extensions, are described in the GBIF Metadata Profile – How-to Guide. For reference, the 2.1.1 standard can be seen here.

It is now time for us to upgrade to the latest EML version, 2.2.0. There are several new elements which we plan to support, some of which will replace the GBIF extensions. A summary of the changes in 2.2.0 is available. It is also a good time to introduce multilingual support, so a dataset can be described in more than one language.

These updates will allow us to remove the need for some of the custom elements (or custom use of standard elements) added by GBIF.

New or updated elements

We will support these new or updated elements. Information within these elements will be added to the REST (JSON) API and shown on dataset pages as appropriate.

New or updated elements — support not planned

We don't plan to support these elements at this stage.

GBIF Extension

These are all the elements of the GBIF extension:

NCD elements.

These are from the obsolete TDWG Natural Collections Description Data (NCD) draft standard. We will leave them as they are, but in future they could be replaced by elements from a TDWG Collection Descriptions standard.

ManonGros commented 11 months ago

Thank you Matt! I think this makes sense. I think we use the NCD elements in GRSciColl synchronisation when a dataset is set as source of information for a collection. Something to keep in mind if/when we update to the TDWG Collection Descriptions standard for datasets. See also https://github.com/gbif/registry/issues/319#issuecomment-904567424

MattBlissett commented 10 months ago

Suggestion for the registry dataset API response for the DocBook-formatted fields: Respond with HTML formatting, which is easy for consumers to use and sort-of what we have already, i.e. convert <para><p> etc, and remove the current \n that are inserted between paragraphs.

I think everything there has a direct HTML equivalent.

mike-podolskiy90 commented 4 months ago
<distribution scope="document">
  <online>
    <url function="information">https://reeflifesurvey.com/</url>
    <url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
  </online>
</distribution>

does not look like a valid example according to the schema (only one url in online and online in distribution). I think correct one would be:

<distribution scope="document">
  <online>
    <url function="information">https://reeflifesurvey.com/</url>
  </online>
</distribution>
<distribution scope="document">
  <online>
    <url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
  </online>
</distribution>

Not sure about the distribution's scope attribute though. Might be one of system or document. It seems we accept any strings so maybe worth adding enumeration for it.

MattBlissett commented 4 months ago

The scope of the identifier. Scope is generally set to either "system", meaning that it is scoped according to the "system" attribute, or "document" if it is only to be in scope within this single document instance.

Since there isn't an identifier, I don't think we need a scope either.

<distribution>
  <online>
    <url function="information">https://reeflifesurvey.com/</url>
  </online>
</distribution>
<distribution>
  <online>
    <url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
  </online>
</distribution>
mike-podolskiy90 commented 4 months ago

So should I remove the attribute in the new schema?

MattBlissett commented 4 months ago

I think that makes most sense.

mike-podolskiy90 commented 4 months ago

@MattBlissett You haven't mentioned dataset/project/relatedProject - we might be interested in it, related https://github.com/gbif/ipt/issues/1780 https://github.com/gbif/ipt/issues/1927

MattBlissett commented 4 months ago

I missed that, please add it.

mike-podolskiy90 commented 4 months ago

@MattBlissett I might be mistaken, but there is no dataset/physical in the dataset schema https://eml.ecoinformatics.org/eml-2.2.0/eml-dataset.xsd

mike-podolskiy90 commented 4 months ago

Also, we made emails (electronicMailAddress) a multi-valued field in the Agent entity. Should we do that for address, phone and onlineUrl too? All of those fields are already multi-valued in the registry.

mike-podolskiy90 commented 4 months ago

I found more new/absent fields we haven't discussed yet, but might want to include:

Possible changes to the existing elements:

mdoering commented 4 months ago

From the perspective of ChecklistBank and metadata used there I would really appreciate if we'd support shortName and changeHistory, both being used there.

MattBlissett commented 4 months ago

@MattBlissett I might be mistaken, but there is no dataset/physical in the dataset schema https://eml.ecoinformatics.org/eml-2.2.0/eml-dataset.xsd

I was probably looking at https://eml.ecoinformatics.org/eml-schema#the-eml-physical-module---physical-file-format but I'm now confused on how it fits in.

MattBlissett commented 4 months ago

I think it's fine to include other fields if other projects request them, but I recommend not adding everything — stored procedures are irrelevant, for example. Most of those are older fields which were excluded before, so I didn't change that.

The dataTable field could be used, but it would seem to duplicate other dataset descriptors (meta.xml, Frictionless). I think implementing it would be a lot of work, and not worth it when no-one has shown any interest.

mike-podolskiy90 commented 3 months ago

@MattBlissett I think I missed that, so we should also support extension of the ParagraphType (para in the GBIF schema) to support all new elements (value, itemizedlist, orderedlist, emphasis, subscript, superscript, literalLayout, ulink)? And also start supporting SectionType (section)?

mike-podolskiy90 commented 3 months ago

Btw that example is invalid:

<abstract>
  <para><emphasis>Reef Life Survey</emphasis> (RLS) aims to improve biodiversity conservation...</para>
  <section>
    <title>A separate section</title>
    <para>More text</para>
    <para>And more text, with
      <itemizedlist>
        <listitem>First item</listitem>
      </itemizedlist>
      <orderedlist>
        <listitem>First item</listitem>
      </orderedlist>
      <section>
        <title>A sub-section</title>
        <emphasis>Emphasis</emphasis>
        CO<subscript>2</subscript> (or just CO₂)
        m<superscript>3</superscript> (or just m³)
        <literalLayout>
          x = fn(y, z)
        </literalLayout>
      </section>
      <ulink url="https://example.org"><citetitle>Example link</citetitle></ulink>
    </para>
  </section>
</abstract>

Issues are:

Valid example would be something like this:

<abstract>
  <para><emphasis>Reef Life Survey</emphasis> (RLS) aims to improve biodiversity conservation...</para>
  <section>
    <title>A separate section</title>
    <para>More text</para>
    <para>And more text, with
      <itemizedlist>
        <listitem><para>First item</para></listitem>
      </itemizedlist>
      <orderedlist>
        <listitem><para>First item</para></listitem>
      </orderedlist>
      <ulink url="https://example.org"><citetitle>Example link</citetitle></ulink>
    </para>
    <section>
      <title>A sub-section</title>
      <para><emphasis>Emphasis</emphasis>
      CO<subscript>2</subscript> (or just CO₂)
      m<superscript>3</superscript> (or just m³)
      <literalLayout>
        x = fn(y, z)
      </literalLayout>
      </para>
    </section>
    </section>
</abstract>
MattBlissett commented 3 months ago

@MattBlissett I think I missed that, so we should also support extension of the ParagraphType (para in the GBIF schema) to support all new elements (value, itemizedlist, orderedlist, emphasis, subscript, superscript, literalLayout, ulink)? And also start supporting SectionType (section)?

Yes please. I know this is probably annoying, but the IPT currently writes escaped HTML into the descriptive formats, which means other users of EML have to handle this — it's not ideal when EML itself includes equivalent formatting.

I suggest we support the DocBook elements where they are defined, and adjust the IPT and Registry to use them.

mike-podolskiy90 commented 3 months ago

EML link https://eml.ecoinformatics.org/schema/eml-text_xsd.html#TextType_para