MattBlissett commented 1 year ago

Updating EML within GBIF

Ecological Metadata Language, EML, is the primary data standard used by Darwin Core Archives to provide metadata about a dataset — descriptions, information on geographic and taxonomic coverage, contacts, publishers and so on. It is also used as part of GBIF's API, and included in Darwin Core Archive data downloads.

Since 2011 we have been using EML version 2.1.1, extended with some additional properties including some from the Natural Collections Description Data (NCD) draft standard. The EML properties we recognize, as well as these extensions, are described in the GBIF Metadata Profile – How-to Guide. For reference, the 2.1.1 standard can be seen here.

It is now time for us to upgrade to the latest EML version, 2.2.0. There are several new elements which we plan to support, some of which will replace the GBIF extensions. A summary of the changes in 2.2.0 is available. It is also a good time to introduce multilingual support, so a dataset can be described in more than one language.

These updates will allow us to remove the need for some of the custom elements (or custom use of standard elements) added by GBIF.

New or updated elements

We will support these new or updated elements. Information within these elements will be added to the REST (JSON) API and shown on dataset pages as appropriate.

New: dataset/licensed/{licenseName,url,identifier} — this will properly reference the licence used by a dataset, rather than the special ulink used at present under dataset/intellectualRights. EML also recommends a particular set of licence URIs, different to what we use ourselves (see also https://github.com/gbif/ipt/issues/1967). We will recognize the values preferred by EML (spdx.org) as well as the existing values (creativecommons.org...) TODO: What values should we use for EML we produce? Old:
```
<intellectualRights>
<para>
  This work is licensed under a
  <ulink url="http://creativecommons.org/licenses/by/4.0/legalcode">
    <citetitle>Creative Commons Attribution (CC-BY) 4.0 License</citetitle>
  </ulink>.
</para>
</intellectualRights>
```
New:
```
<licensed>
<licenseName>Creative Commons Attribution 4.0 International</licenseName>
<url>https://spdx.org/licenses/CC-BY-4.0.html</url>
<identifier>CC-BY-4.0</identifier>
</licensed>
```

New: dataset/distribution/online — this currently lists the dataset homepage with the function "information". It may in addition link directly to a data download: Old:

<distribution scope="document">
<online>
  <url function="information">https://reeflifesurvey.com/</url>
</online>
</distribution>

New, dataset published by IPT:

<distribution scope="document">
<online>
  <url function="information">https://reeflifesurvey.com/</url>
</online>
</distribution>
<distribution scope="document">
<online>
  <url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
</online>
</distribution>

New: dataset/introduction — New. One to many paragraphs that provide background and context for the dataset with appropriate figures and references. This is similar to the introduction for a journal article, and would include, for example, project objectives, hypotheses being addressed, what is known about the pattern or process under study, how the data have been used to date (including references), and how they could be used in the future.
```
<introduction>
<para>Introduction to the dataset...</para>
</introduction>
```
New: dataset/gettingStarted — New. One or more paragraphs that describe the overall interpretation, content and structure of the dataset. For example, the number and names of data files, they types of measurements that they contain, how those data files fit together in an overall design, and how they relate to the data collections methods, experimental design, and sampling design that are described in other EML sections. One might describe any specialized software that is available and/or may be necessary for analyzing or interpreting the data, and possibly include a high level description of data formats if they are unusual, keeping in mind that detailed descriptions of data structure and format are contained in the entity sections of EML. Citations, inline figures, and inline images can be included via inline references in Markdown sections.
```
<gettingStarted>
<para>A high level overview of interpretation, structure, and content of the dataset.</para>
</gettingStarted
```

New: dataset/acknowledgements — New, "text that acknowledges funders and other key contributors." Three new elements will be supported, all with multilingual support. We will support the DocBook subset used in EML, and prefer this to adding HTML. There are incompatibilities between EML's new Markdown support and multilingual support (#6), so we do not plan to support Markdown at this stage. Old:

<abstract>
<para>
  &lt;em&gt;Reef Life Survey&lt;/em&gt; (RLS) aims to improve biodiversity conservation...

New, showing all available formatting using DocBook:

<abstract>
<para><emphasis>Reef Life Survey</emphasis> (RLS) aims to improve biodiversity conservation...</para>
<section>
  <title>A separate section</title>
  <para>More text</para>
  <para>And more text, with
    <itemizedlist>
      <listitem>First item</listitem>
    </itemizedlist>
    <orderedlist>
      <listitem>First item</listitem>
    </orderedlist>
    <section>
      <title>A sub-section</title>
      <emphasis>Emphasis</emphasis>
      CO<subscript>2</subscript> (or just CO₂)
      m<superscript>3</superscript> (or just m³)
      <literalLayout>
        x = fn(y, z)
      </literalLayout>
    </section>
    <ulink url="https://example.org"><citetitle>Example link</citetitle></ulink>
  </para>
</section>
</abstract>

dataset/project/award — new element for structured funding information, all useful for project tracking. Note there is no multilingual support for the project title, so we will ignore the support on some of the other parts of a project. New:

<project>
<title>...
<personnel>...
<abstract>...
<funding>...
<award>
  <funderName>Tanzania Wildlife Research Institute</funderName>
  <funderIdentifier>https://doi.org/10.13039/501100005914</funderIdentifier>
  <awardNumber>1234567890</awardNumber>
  <title>Feeding Ecology of Mahale Chimpanzee</title>
  <awardUrl>https://tawiri.or.tz/research-documents/research/</awardUrl>
</award>
<studyAreaDescription>...
<designDescription>...
</project>

dataset/project/relatedProject — new element, recursive links to other projects
dataset/creator/individualName/salutation — this is not a new element, but will allow us to preserve titles (Dr., Prof. etc) in dataset contact names while excluding them from generated citations New:
```
<individualName>
<salutation>Mr</salutation>
<givenName>Matthew</givenName>
<surName>Blissett</surName>
</individualName>
```

dataset/literatureCited — replaces additionalMetadata/metadata/gbif/bibliography. Note we plan only BibTeX support. Old:

<additionalMetadata>
<metadata>
  <gbif>
    <bibliography>
      <citation>Hamer, M., Victor, J., Smith, G.F. (2012). Best Practice Guide for Compiling, Maintaining and Disseminating National Species Checklists, version 1.0, released in October 2012. Copenhagen: Global Biodiversity Information Facility, 40 pp, ISBN: 87-92020-48-8, Accessible at http://www.gbif.org/orc/?doc_id=4752.</citation>

New:

<literatureCited>
<bibtex>
  @book{checklists_2012,
        title = {Best {Practice} {Guide} for {Compiling}, {Maintaining} and ...,
        author = {Hamer, Michelle and Smith, J and ...},
        year = {2012},
        ...
  }
</bibtex>
...

Dedicated issue https://github.com/gbif/gbif-metadata-profile/issues/29, won't be included in the 1.3 profile

~dataset/physical~ — there's no such thing
dataset/publisher — describes the publisher of the data. This is presented elsewhere in the GBIF API, and we will not change the way a dataset is registered using the API. However, we can expose the publisher in the GBIF-generated EML for a dataset: New:
```
<publisher id="7ce8aef0-9e92-11dc-8738-b8a03c50a862" scope="system" system="http://gbif.org">
 <organizationName>Plazi.org taxonomic treatment database</organizationName>
</publisher>
```

New or updated elements — support not planned

We don't plan to support these elements at this stage.

dataset/usageCitation — This can expose known citations of a dataset. Presented elsewhere in the API.
dataset/coverage/taxonomicCoverage/taxonomicClassification/taxonId — taxonomic identifiers added. Not yet supported by occurrences.
dataset/annotation — structured annotations (key-value). Probably deserves to be a separate task.
dataset/pubPlace — a publisher address. Not clear to me why this is different to an address on the publisher.
dataset/referencePublication — "Common cases where a Reference Publication may be useful include when a data paper is published that describes the dataset, or when a paper is intended to be the canonical or examplar reference to the dataset."

GBIF Extension

These are all the elements of the GBIF extension:

dateStamp — keep
metadataLanguage — kept, but could use xml:lang on dataset instead
heirarchyLevel — kept
citation — kept, as EML suggests generating a citation from the components (authors etc)
bibliography — replaced with dataset/literatureCited
physical — kept
replaces — kept
resourceLogoUrl — kept

NCD elements.

These are from the obsolete TDWG Natural Collections Description Data (NCD) draft standard. We will leave them as they are, but in future they could be replaced by elements from a TDWG Collection Descriptions standard.

collection/parentCollectionIdentifier
collection/collectionName
collection/collectionIdentifier
formationPeriod
livingTimePeriod
specimenPreservationMethod
jgtiCuratorialUnit

ManonGros commented 1 year ago

Thank you Matt! I think this makes sense. I think we use the NCD elements in GRSciColl synchronisation when a dataset is set as source of information for a collection. Something to keep in mind if/when we update to the TDWG Collection Descriptions standard for datasets. See also https://github.com/gbif/registry/issues/319#issuecomment-904567424

MattBlissett commented 1 year ago

Suggestion for the registry dataset API response for the DocBook-formatted fields: Respond with HTML formatting, which is easy for consumers to use and sort-of what we have already, i.e. convert <para> → <p> etc, and remove the current \n that are inserted between paragraphs.

I think everything there has a direct HTML equivalent.

mike-podolskiy90 commented 9 months ago

<distribution scope="document">
  <online>
    <url function="information">https://reeflifesurvey.com/</url>
    <url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
  </online>
</distribution>

does not look like a valid example according to the schema (only one url in online and online in distribution). I think correct one would be:

<distribution scope="document">
  <online>
    <url function="information">https://reeflifesurvey.com/</url>
  </online>
</distribution>
<distribution scope="document">
  <online>
    <url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
  </online>
</distribution>

Not sure about the distribution's scope attribute though. Might be one of system or document. It seems we accept any strings so maybe worth adding enumeration for it.

MattBlissett commented 9 months ago

The scope of the identifier. Scope is generally set to either "system", meaning that it is scoped according to the "system" attribute, or "document" if it is only to be in scope within this single document instance.

Since there isn't an identifier, I don't think we need a scope either.

<distribution>
  <online>
    <url function="information">https://reeflifesurvey.com/</url>
  </online>
</distribution>
<distribution>
  <online>
    <url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
  </online>
</distribution>

mike-podolskiy90 commented 9 months ago

So should I remove the attribute in the new schema?

MattBlissett commented 9 months ago

I think that makes most sense.

mike-podolskiy90 commented 9 months ago

@MattBlissett You haven't mentioned dataset/project/relatedProject - we might be interested in it, related https://github.com/gbif/ipt/issues/1780 https://github.com/gbif/ipt/issues/1927

MattBlissett commented 9 months ago

I missed that, please add it.

mike-podolskiy90 commented 9 months ago

@MattBlissett I might be mistaken, but there is no dataset/physical in the dataset schema https://eml.ecoinformatics.org/eml-2.2.0/eml-dataset.xsd

mike-podolskiy90 commented 9 months ago

Also, we made emails (electronicMailAddress) a multi-valued field in the Agent entity. Should we do that for address, phone and onlineUrl too? All of those fields are already multi-valued in the registry.

mike-podolskiy90 commented 9 months ago

I found more new/absent fields we haven't discussed yet, but might want to include:

eml/access - An optional access tree at this location controls access to the entire metadata document. If this access element is omitted from the document, then the package submitter should be given full access to the package but all other users should be denied all access.
eml/dataset/shortName - The 'shortName' field provides a concise name that describes the resource that is being documented. It is the appropriate place to store a filename associated with other storage systems.
eml/dataset/series - This field describes the series of resources that include the resource being described. For example, a volume of a journal may be part of a series of the journal for a particular year.
eml/dataset/dataTable - The dataTable field documents the dataTable(s) that make up this dataset. A dataTable could be anything from a Comma Separated Value (CSV) file to a spreadsheet to a table in an RDBMS.
eml/dataset/spatialRaster - The spatialRaster field describes any spatial raster images included in this dataset.
eml/dataset/spatialVector - The spatialVector field describes any spatial vectors included in this dataset.
eml/dataset/storedProcedure - The storedProcedure field contains information about any stored procedures included with this dataset. This usually implies that the dataset is stored in a DBMS or some other data management system capable of processing your dataset.
eml/dataset/view - The view field contains information about any view included with this dataset. This usually implies that the dataset is stored in a DBMS or some other data management system capable of processing your dataset.
eml/dataset/otherEntity - The otherEntity field contains information about any entity in the dataset that is not any of the preceding entities. (i.e. it is not a table, spatialRaster, spatialVector, storedProcedure or view.) OtherEntity allows the documentation of basic entity fields as well as a plain text field to allow you to type your entity.
eml/dataset/maintenance/changeHistory - A description of changes made to the data since its release.
eml/dataset/coverage/geographicCoverage/datasetGPolygon - This construct creates a spatial ring with a hollow center. This doughnut shape is specified by the outer ring (datasetGPolygonOuterRing) and the inner exclusion zone (datasetGPolygonExclusionGRing) which can be thought of as the hole in the center of a doughnut. This is useful for defining areas such as the shores of a pond where you only want to specify the shore excluding the pond itself.
eml/dataset/coverage/geographicCoverage/boundingCoordinates/boundingAltitudes - The bounding altitude field is intended to contain altitudinal (elevation) measurements for the bounding box being described. It allows for minimum and maximum altitude fields, as well as a field for the units of measure. The combination of these fields provide the vertical extent information for the bounding box.
eml/dataset/coverage/taxonomicCoverage/taxonomicSystem - Documentation of taxonomic sources, procedures, andtreatments.

Possible changes to the existing elements:

eml/dataset/methods/methodStep - make multi-valued
eml/dataset/project/title - make multi-valued
eml/dataset/coverage/temporalCoverage/singleDateTime - make multi-valued
adress, phone, onlineUrl for agents (creator, contact etc.) - make multi-valued

mdoering commented 9 months ago

From the perspective of ChecklistBank and metadata used there I would really appreciate if we'd support shortName and changeHistory, both being used there.

MattBlissett commented 9 months ago

@MattBlissett I might be mistaken, but there is no dataset/physical in the dataset schema https://eml.ecoinformatics.org/eml-2.2.0/eml-dataset.xsd

I was probably looking at https://eml.ecoinformatics.org/eml-schema#the-eml-physical-module---physical-file-format but I'm now confused on how it fits in.

MattBlissett commented 9 months ago

I think it's fine to include other fields if other projects request them, but I recommend not adding everything — stored procedures are irrelevant, for example. Most of those are older fields which were excluded before, so I didn't change that.

The dataTable field could be used, but it would seem to duplicate other dataset descriptors (meta.xml, Frictionless). I think implementing it would be a lot of work, and not worth it when no-one has shown any interest.

mike-podolskiy90 commented 8 months ago

@MattBlissett I think I missed that, so we should also support extension of the ParagraphType (para in the GBIF schema) to support all new elements (value, itemizedlist, orderedlist, emphasis, subscript, superscript, literalLayout, ulink)? And also start supporting SectionType (section)?

mike-podolskiy90 commented 8 months ago

Btw that example is invalid:

<abstract>
  <para><emphasis>Reef Life Survey</emphasis> (RLS) aims to improve biodiversity conservation...</para>
  <section>
    <title>A separate section</title>
    <para>More text</para>
    <para>And more text, with
      <itemizedlist>
        <listitem>First item</listitem>
      </itemizedlist>
      <orderedlist>
        <listitem>First item</listitem>
      </orderedlist>
      <section>
        <title>A sub-section</title>
        <emphasis>Emphasis</emphasis>
        CO<subscript>2</subscript> (or just CO₂)
        m<superscript>3</superscript> (or just m³)
        <literalLayout>
          x = fn(y, z)
        </literalLayout>
      </section>
      <ulink url="https://example.org"><citetitle>Example link</citetitle></ulink>
    </para>
  </section>
</abstract>

Issues are:

listitem must have an para inside
para can't directly contain section
section can't directly contain emphasis

Valid example would be something like this:

<abstract>
  <para><emphasis>Reef Life Survey</emphasis> (RLS) aims to improve biodiversity conservation...</para>
  <section>
    <title>A separate section</title>
    <para>More text</para>
    <para>And more text, with
      <itemizedlist>
        <listitem><para>First item</para></listitem>
      </itemizedlist>
      <orderedlist>
        <listitem><para>First item</para></listitem>
      </orderedlist>
      <ulink url="https://example.org"><citetitle>Example link</citetitle></ulink>
    </para>
    <section>
      <title>A sub-section</title>
      <para><emphasis>Emphasis</emphasis>
      CO<subscript>2</subscript> (or just CO₂)
      m<superscript>3</superscript> (or just m³)
      <literalLayout>
        x = fn(y, z)
      </literalLayout>
      </para>
    </section>
    </section>
</abstract>

MattBlissett commented 8 months ago

@MattBlissett I think I missed that, so we should also support extension of the ParagraphType (para in the GBIF schema) to support all new elements (value, itemizedlist, orderedlist, emphasis, subscript, superscript, literalLayout, ulink)? And also start supporting SectionType (section)?

Yes please. I know this is probably annoying, but the IPT currently writes escaped HTML into the descriptive formats, which means other users of EML have to handle this — it's not ideal when EML itself includes equivalent formatting.

I suggest we support the DocBook elements where they are defined, and adjust the IPT and Registry to use them.

mike-podolskiy90 commented 8 months ago

EML link https://eml.ecoinformatics.org/schema/eml-text_xsd.html#TextType_para

gbif / eml-profile

Update GBIF's EML profile (EML 2.2.0) #5

Updating EML within GBIF

New or updated elements

New or updated elements — support not planned

GBIF Extension

NCD elements.