Closed MattBlissett closed 1 month ago
Thank you Matt! I think this makes sense. I think we use the NCD elements in GRSciColl synchronisation when a dataset is set as source of information for a collection. Something to keep in mind if/when we update to the TDWG Collection Descriptions standard for datasets. See also https://github.com/gbif/registry/issues/319#issuecomment-904567424
Suggestion for the registry dataset API response for the DocBook-formatted fields: Respond with HTML formatting, which is easy for consumers to use and sort-of what we have already, i.e. convert <para>
→ <p>
etc, and remove the current \n
that are inserted between paragraphs.
I think everything there has a direct HTML equivalent.
<distribution scope="document">
<online>
<url function="information">https://reeflifesurvey.com/</url>
<url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
</online>
</distribution>
does not look like a valid example according to the schema (only one url
in online
and online
in distribution
). I think correct one would be:
<distribution scope="document">
<online>
<url function="information">https://reeflifesurvey.com/</url>
</online>
</distribution>
<distribution scope="document">
<online>
<url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
</online>
</distribution>
Not sure about the distribution's scope
attribute though. Might be one of system
or document
. It seems we accept any strings so maybe worth adding enumeration for it.
The scope of the identifier. Scope is generally set to either "system", meaning that it is scoped according to the "system" attribute, or "document" if it is only to be in scope within this single document instance.
Since there isn't an identifier, I don't think we need a scope either.
<distribution>
<online>
<url function="information">https://reeflifesurvey.com/</url>
</online>
</distribution>
<distribution>
<online>
<url function="download">https://cloud.gbif.org/griis/archive.do?r=global</url>
</online>
</distribution>
So should I remove the attribute in the new schema?
I think that makes most sense.
@MattBlissett You haven't mentioned dataset/project/relatedProject
- we might be interested in it, related https://github.com/gbif/ipt/issues/1780 https://github.com/gbif/ipt/issues/1927
I missed that, please add it.
@MattBlissett I might be mistaken, but there is no dataset/physical
in the dataset schema https://eml.ecoinformatics.org/eml-2.2.0/eml-dataset.xsd
Also, we made emails (electronicMailAddress
) a multi-valued field in the Agent
entity. Should we do that for address
, phone
and onlineUrl
too? All of those fields are already multi-valued in the registry.
I found more new/absent fields we haven't discussed yet, but might want to include:
eml/access
- An optional access tree at this location controls access to the entire metadata document. If this access element is omitted from the document, then the package submitter should be given full access to the package but all other users should be denied all access.eml/dataset/shortName
- The 'shortName' field provides a concise name that describes the resource that is being documented. It is the appropriate place to store a filename associated with other storage systems.eml/dataset/series
- This field describes the series of resources that include the resource being described. For example, a volume of a journal may be part of a series of the journal for a particular year.eml/dataset/dataTable
- The dataTable field documents the dataTable(s) that make up this dataset. A dataTable could be anything from a Comma Separated Value (CSV) file to a spreadsheet to a table in an RDBMS.eml/dataset/spatialRaster
- The spatialRaster field describes any spatial raster images included in this dataset.eml/dataset/spatialVector
- The spatialVector field describes any spatial vectors included in this dataset.eml/dataset/storedProcedure
- The storedProcedure field contains information about any stored procedures included with this dataset. This usually implies that the dataset is stored in a DBMS or some other data management system capable of processing your dataset.eml/dataset/view
- The view field contains information about any view included with this dataset. This usually implies that the dataset is stored in a DBMS or some other data management system capable of processing your dataset.eml/dataset/otherEntity
- The otherEntity field contains information about any entity in the dataset that is not any of the preceding entities. (i.e. it is not a table, spatialRaster, spatialVector, storedProcedure or view.) OtherEntity allows the documentation of basic entity fields as well as a plain text field to allow you to type your entity.eml/dataset/maintenance/changeHistory
- A description of changes made to the data since its release.eml/dataset/coverage/geographicCoverage/datasetGPolygon
- This construct creates a spatial ring with a hollow center. This doughnut shape is specified by the outer ring (datasetGPolygonOuterRing) and the inner exclusion zone (datasetGPolygonExclusionGRing) which can be thought of as the hole in the center of a doughnut. This is useful for defining areas such as the shores of a pond where you only want to specify the shore excluding the pond itself.eml/dataset/coverage/geographicCoverage/boundingCoordinates/boundingAltitudes
- The bounding altitude field is intended to contain altitudinal (elevation) measurements for the bounding box being described. It allows for minimum and maximum altitude fields, as well as a field for the units of measure. The combination of these fields provide the vertical extent information for the bounding box.eml/dataset/coverage/taxonomicCoverage/taxonomicSystem
- Documentation of taxonomic sources, procedures, andtreatments.Possible changes to the existing elements:
eml/dataset/methods/methodStep
- make multi-valuedeml/dataset/project/title
- make multi-valuedeml/dataset/coverage/temporalCoverage/singleDateTime
- make multi-valuedadress
, phone
, onlineUrl
for agents (creator
, contact
etc.) - make multi-valuedFrom the perspective of ChecklistBank and metadata used there I would really appreciate if we'd support shortName
and changeHistory
, both being used there.
@MattBlissett I might be mistaken, but there is no
dataset/physical
in the dataset schema https://eml.ecoinformatics.org/eml-2.2.0/eml-dataset.xsd
I was probably looking at https://eml.ecoinformatics.org/eml-schema#the-eml-physical-module---physical-file-format but I'm now confused on how it fits in.
I think it's fine to include other fields if other projects request them, but I recommend not adding everything — stored procedures are irrelevant, for example. Most of those are older fields which were excluded before, so I didn't change that.
The dataTable field could be used, but it would seem to duplicate other dataset descriptors (meta.xml, Frictionless). I think implementing it would be a lot of work, and not worth it when no-one has shown any interest.
@MattBlissett I think I missed that, so we should also support extension of the ParagraphType
(para
in the GBIF schema) to support all new elements (value
, itemizedlist
, orderedlist
, emphasis
, subscript
, superscript
, literalLayout
, ulink
)? And also start supporting SectionType
(section
)?
Btw that example is invalid:
<abstract>
<para><emphasis>Reef Life Survey</emphasis> (RLS) aims to improve biodiversity conservation...</para>
<section>
<title>A separate section</title>
<para>More text</para>
<para>And more text, with
<itemizedlist>
<listitem>First item</listitem>
</itemizedlist>
<orderedlist>
<listitem>First item</listitem>
</orderedlist>
<section>
<title>A sub-section</title>
<emphasis>Emphasis</emphasis>
CO<subscript>2</subscript> (or just CO₂)
m<superscript>3</superscript> (or just m³)
<literalLayout>
x = fn(y, z)
</literalLayout>
</section>
<ulink url="https://example.org"><citetitle>Example link</citetitle></ulink>
</para>
</section>
</abstract>
Issues are:
listitem
must have an para
insidepara
can't directly contain section
section
can't directly contain emphasis
Valid example would be something like this:
<abstract>
<para><emphasis>Reef Life Survey</emphasis> (RLS) aims to improve biodiversity conservation...</para>
<section>
<title>A separate section</title>
<para>More text</para>
<para>And more text, with
<itemizedlist>
<listitem><para>First item</para></listitem>
</itemizedlist>
<orderedlist>
<listitem><para>First item</para></listitem>
</orderedlist>
<ulink url="https://example.org"><citetitle>Example link</citetitle></ulink>
</para>
<section>
<title>A sub-section</title>
<para><emphasis>Emphasis</emphasis>
CO<subscript>2</subscript> (or just CO₂)
m<superscript>3</superscript> (or just m³)
<literalLayout>
x = fn(y, z)
</literalLayout>
</para>
</section>
</section>
</abstract>
@MattBlissett I think I missed that, so we should also support extension of the
ParagraphType
(para
in the GBIF schema) to support all new elements (value
,itemizedlist
,orderedlist
,emphasis
,subscript
,superscript
,literalLayout
,ulink
)? And also start supportingSectionType
(section
)?
Yes please. I know this is probably annoying, but the IPT currently writes escaped HTML into the descriptive formats, which means other users of EML have to handle this — it's not ideal when EML itself includes equivalent formatting.
I suggest we support the DocBook elements where they are defined, and adjust the IPT and Registry to use them.
Updating EML within GBIF
Ecological Metadata Language, EML, is the primary data standard used by Darwin Core Archives to provide metadata about a dataset — descriptions, information on geographic and taxonomic coverage, contacts, publishers and so on. It is also used as part of GBIF's API, and included in Darwin Core Archive data downloads.
Since 2011 we have been using EML version 2.1.1, extended with some additional properties including some from the Natural Collections Description Data (NCD) draft standard. The EML properties we recognize, as well as these extensions, are described in the GBIF Metadata Profile – How-to Guide. For reference, the 2.1.1 standard can be seen here.
It is now time for us to upgrade to the latest EML version, 2.2.0. There are several new elements which we plan to support, some of which will replace the GBIF extensions. A summary of the changes in 2.2.0 is available. It is also a good time to introduce multilingual support, so a dataset can be described in more than one language.
These updates will allow us to remove the need for some of the custom elements (or custom use of standard elements) added by GBIF.
New or updated elements
We will support these new or updated elements. Information within these elements will be added to the REST (JSON) API and shown on dataset pages as appropriate.
New:
dataset/licensed/{licenseName,url,identifier}
— this will properly reference the licence used by a dataset, rather than the special ulink used at present underdataset/intellectualRights
. EML also recommends a particular set of licence URIs, different to what we use ourselves (see also https://github.com/gbif/ipt/issues/1967). We will recognize the values preferred by EML (spdx.org) as well as the existing values (creativecommons.org...) TODO: What values should we use for EML we produce? Old:New:
New:
dataset/distribution/online
— this currently lists the dataset homepage with the function"information"
. It may in addition link directly to a data download: Old:New, dataset published by IPT:
New:
dataset/introduction
— New. One to many paragraphs that provide background and context for the dataset with appropriate figures and references. This is similar to the introduction for a journal article, and would include, for example, project objectives, hypotheses being addressed, what is known about the pattern or process under study, how the data have been used to date (including references), and how they could be used in the future.New:
dataset/gettingStarted
— New. One or more paragraphs that describe the overall interpretation, content and structure of the dataset. For example, the number and names of data files, they types of measurements that they contain, how those data files fit together in an overall design, and how they relate to the data collections methods, experimental design, and sampling design that are described in other EML sections. One might describe any specialized software that is available and/or may be necessary for analyzing or interpreting the data, and possibly include a high level description of data formats if they are unusual, keeping in mind that detailed descriptions of data structure and format are contained in the entity sections of EML. Citations, inline figures, and inline images can be included via inline references in Markdown sections.New:
dataset/acknowledgements
— New, "text that acknowledges funders and other key contributors." Three new elements will be supported, all with multilingual support. We will support the DocBook subset used in EML, and prefer this to adding HTML. There are incompatibilities between EML's new Markdown support and multilingual support (#6), so we do not plan to support Markdown at this stage. Old:New, showing all available formatting using DocBook:
dataset/project/award
— new element for structured funding information, all useful for project tracking. Note there is no multilingual support for the project title, so we will ignore the support on some of the other parts of a project. New:dataset/project/relatedProject
— new element, recursive links to other projectsdataset/creator/individualName/salutation
— this is not a new element, but will allow us to preserve titles (Dr., Prof. etc) in dataset contact names while excluding them from generated citations New:dataset/literatureCited
— replacesadditionalMetadata/metadata/gbif/bibliography
. Note we plan only BibTeX support. Old:New:
Dedicated issue https://github.com/gbif/gbif-metadata-profile/issues/29, won't be included in the 1.3 profile
~
dataset/physical
~ — there's no such thingdataset/publisher
— describes the publisher of the data. This is presented elsewhere in the GBIF API, and we will not change the way a dataset is registered using the API. However, we can expose the publisher in the GBIF-generated EML for a dataset: New:New or updated elements — support not planned
We don't plan to support these elements at this stage.
dataset/usageCitation
— This can expose known citations of a dataset. Presented elsewhere in the API.dataset/coverage/taxonomicCoverage/taxonomicClassification/taxonId
— taxonomic identifiers added. Not yet supported by occurrences.dataset/annotation
— structured annotations (key-value). Probably deserves to be a separate task.dataset/pubPlace
— a publisher address. Not clear to me why this is different to an address on the publisher.dataset/referencePublication
— "Common cases where a Reference Publication may be useful include when a data paper is published that describes the dataset, or when a paper is intended to be the canonical or examplar reference to the dataset."GBIF Extension
These are all the elements of the GBIF extension:
dateStamp
— keepmetadataLanguage
— kept, but could usexml:lang
ondataset
insteadheirarchyLevel
— keptcitation
— kept, as EML suggests generating a citation from the components (authors etc)bibliography
— replaced withdataset/literatureCited
physical
— keptreplaces
— keptresourceLogoUrl
— keptNCD elements.
These are from the obsolete TDWG Natural Collections Description Data (NCD) draft standard. We will leave them as they are, but in future they could be replaced by elements from a TDWG Collection Descriptions standard.
collection/parentCollectionIdentifier
collection/collectionName
collection/collectionIdentifier
formationPeriod
livingTimePeriod
specimenPreservationMethod
jgtiCuratorialUnit