FDSN / StationXML

The FDSN StationXML schema and related documents
https://docs.fdsn.org/projects/stationxml/
12 stars 16 forks source link

Data license declaration capability #95

Open chad-earthscope opened 3 years ago

chad-earthscope commented 3 years ago

Currently there is no clear place to include a data license declaration in StationXML and doing so is becoming increasingly important.

One option is to add this by allowing a DataLicense element in the BaseNode definition, which would allow declaration at the Network, Station, and Channel levels.

The element would be optional and could occur any number of times. An abbreviation attribute allows declaration of the common label often used, e.g. CC0, CC-BY, etc. A URL attribute allows identification of license text.

This is analogous to the Identifier element added in 1.1 revision.

For example:

<FDSNStationXML xmlns="http://www.fdsn.org/xml/station/1" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xmlns:iris="http://www.fdsn.org/xml/station/1/iris" 
  xsi:schemaLocation="http://www.fdsn.org/xml/station/1 http://www.fdsn.org/xml/station/fdsn-station-1.1.xsd" 
  schemaVersion="1.1">
...
<Network code="IU" startDate="1988-01-01T00:00:00" restrictedStatus="open">
  <Description>Global Seismograph Network - IRIS/USGS (GSN)</Description>
  <Identifier type="DOI">10.7914/SN/IU</Identifier>
  <DataLicense URL="https://creativecommons.org/share-your-work/public-domain/cc0/"
      abbreviation="CC0">Creative Commons - No Rights Reserved</DataLicense>
  <TotalNumberStations>127</TotalNumberStations>
  <SelectedNumberStations>3</SelectedNumberStations>

In the schema:

<xs:complexType name="BaseNodeType">
  ...
  <xs:sequence>
  <xs:element name="Description" type="xs:string" minOccurs="0"/>
  <xs:element name="Identifier" type="fsx:IdentifierType" minOccurs="0" maxOccurs="unbounded"/>
  <xs:element name="DataLicense" type="fsx:DataLicenseType" minOccurs="0" maxOccurs="unbounded"/>
  ...

and

<xs:complexType name="DataLicenseType">
  <xs:annotation>
    <xs:documentation>A type to document data licenses.
    </xs:documentation>
  </xs:annotation>
  <xs:simpleContent>
    <xs:extension base="xs:string">
      <xs:attribute name="abbreviation" type="xs:string"> </xs:attribute>
      <xs:attribute name="URL" type="xs:string"> </xs:attribute>
    </xs:extension>
  </xs:simpleContent>
</xs:complexType>
crotwell commented 3 years ago

This seems like the kind of issue others have encountered, and stationxml should follow any existing solutions. See for example: https://spdx.dev/ids/

Is there a compelling use case for a new element vs embedding the license in an XML comment, like:

<Network code="IU" startDate="1988-01-01T00:00:00" restrictedStatus="open">
  <!-- 
Licensed under Creative Commons - No Rights Reserved , CCO 
https://creativecommons.org/share-your-work/public-domain/cc0/
-->

or even just:

<Network code="IU" startDate="1988-01-01T00:00:00" restrictedStatus="open">
  <!-- SPDX-License-Identifier: CC0-1.0  -->

In other words, if there is a need for machine parsing of the license, then the syntax should probably be locked down even more than the given example. For example requiring the abbreviation to be from https://spdx.org/licenses/ as opposed to each user making up their own abbreviations. If machine parsing is not needed then the comment idea would work today with no schema changes, which is of course an advantage.

Is there a compelling use case for a channel having a different license from its enclosing station and/or network? I can see a different network perhaps requiring a different license, but we should avoid "xml bloat" where every channel for every station in a network repeats the same license. Even just repeating the abbreviation adds a lot of noise to the xml.

If we incorporating a license is needed, it might be good to consider whether also incorporating a copyright is wise. Copyright and license are orthogonal concepts, but sometimes knowing one without the other is a problem.

Lastly, of course, this sort of thing goes beyond seismologist and software developers and starts to get lawyers involved, so tread carefully.

chad-earthscope commented 3 years ago

In other words, if there is a need for machine parsing of the license, then the syntax should probably be locked down even more than the given example. For example requiring the abbreviation to be from https://spdx.org/licenses/ as opposed to each user making up their own abbreviations. If machine parsing is not needed then the comment idea would work today with no schema changes, which is of course an advantage.

I believe machine parsable is the primary goal of this format and design should be targeting that use. Also, details should be describable in the schema, and I don't think that's possible for comments. For these reasons the XML comment option is less desirable in my opinion.

I completely agree that if we can find a list of abbreviations and/or other definitions and examples to draw from we should.

Lastly, of course, this sort of thing goes beyond seismologist and software developers and starts to get lawyers involved, so tread carefully.

I do not believe licensing data is controversial. In my non-legal option, at this point we risk doing more harm than good by not having the ability to declare a license in standardized metadata.

An issue that should also be considered is whether the declared license covers the metadata in addition to the data is describes. Traditionally we have treated metadata as "public domain" in the sense that it can be freely used with no restrictions (or requirements of citation). We should be clear on the scope of any declaration.

crotwell commented 3 years ago

An issue that should also be considered is whether the declared license covers the metadata in addition to the data is describes.

I missed this, so worth documenting carefully. I thought you were talking about the stationxml, not the miniseed it relates to. "Data" is a pretty generic term, perhaps there is way to make it clearer. The recommendation makes more sense now, and I agree a comment likely is not the right answer.

Regardless, if we are licensing the waveforms, we might need to license the stationxml metadata too. Just because it is "meta" doesn't mean it isn't someone's property.

I do not believe licensing data is controversial. In my non-legal option, at this point we risk doing more harm than good by not having the ability to declare a license in standardized metadata.

What I mean by this is the details may matter a lot. For example, if there are 2 data license elements, does that mean both have to be satisfied, or the user can pick between the two (and vs or). Maybe better to have only one to avoid this ambiguity. Other seeming small details can have outsized effects.

This may be useful: https://wiki.creativecommons.org/wiki/Marking_Works_Technical

jschaeff commented 2 years ago

The licence on the stationXML document should be explicit. An attribute in markup should fit as we don't need to describe complex licencing (99% cases will be CC-0 ?).

The licences markup on the waveform data should have starttime/endtime attributes. We should look at DataCite's way of describing the licences to make sure we can describe complex licencing correctly.

There will be duplication between DOI's metadata and stationXML metadata on this matter. So maybe we should write in the documentation who is right (DOI or stationXML ?). From a datacenter point of view, the licence fields in stationXML could be filled from the same sources as the DataCite's fields to ensure consistency.

crotwell commented 2 years ago

@jschaeff do you have a link for how DataCite does this that would be helpful?

If there was a starttime/endtime on a license, that could get complicated quickly. For example, you could think of the standard PASSCAL data policy as a license. It is proprietary for 2 years after collection, but CC-0 after 2 years, so the license changes with time. Guess I am wondering if the right answer is to provide a way to link to the actual license policy instead of trying to embed it directly, so keep only the url and not have the abbreviation or any text? Then complex cases can be handled by the license holder instead of by stationxml? That would mean that responsibility for dealing with any conflicts or confusion is totally on the license holder, all we provide is a place to put the URL. Common license types, like CC-0 would all use the creative commons url, so it is easy to tell.

Would two elements like WaveformDataLicense and StationXMLLicense help separate the license to use the waveforms, ie miniseed, from a potential license of the stationxml? I am leery of something like MetadataLicense as one person's metadata is another's data. Although documentation perhaps could help with that.

Although not a stationxml issue, the marking of the license really needs to also be on the actual data itself. Not sure if there is a way to standardize how to do this in miniseed2?

jschaeff commented 2 years ago

It appears that DataCite is not very complete for complex license management, see pages 27 and 28 of https://schema.datacite.org/meta/kernel-4.4/doc/DataCite-MetadataKernel_v4.4.pdf

But, RDA came out with Machine Actionable DMP standard that allows to define fixed embargoes. Bur it does not allow to define rolling embargoes. https://github.com/RDA-DMP-Common/RDA-DMP-Common-Standard/blob/master/docs/FAQ.md#how-to-express-embargoes

With a simple URL to the license you would miss the machine readable part. Rolling embargo could be modelized with special tags, although this is a bit cumbersome.

But in the end, I guess, what a machine must know is if the waveform data is open or restricted now. So maybe the current license is enough, as suggested by Chad.

WaveformDataLicense and StationXMLLicense are more explicit, I like it.

crotwell commented 2 years ago

DataCite information is interesting, propose that we reuse as much of what they have created. They use <rights> elements contained in a <rightsList> instead of license, perhaps making more flexible. So perhaps:

<WaveformRights rightsURI="https://creativecommons.org/share-your-work/public-domain/cc0/"
      rightsIdentifier="CC0" >
    Creative Commons - No Rights Reserved
</WaveformRights>

Alternative would be to use <Rights> but then add an attribute or subelement to specify the type data it applies to, like waveform, stationxml, etc. This might give more flexibility in complex cases. Perhaps:

<Rights rightsURI="https://creativecommons.org/share-your-work/public-domain/cc0/"
      rightsIdentifier="CC0" appliesTo="WAVEFORM">
    Creative Commons - No Rights Reserved
</Rights>

Possible to add date ranges, or perhaps things like olderThan="P2Y", but I'm not sure how far down this rabbit hole we should go. Note also DataCite uses lower case rights but stationxml uses capitalized element names.