Validation of identifier

ThomasJejkal commented 6 years ago

Hi,

I'm submitting this issue in my role as co-chair of the RDA Research Data Repository Interoperability WG. We plan to use datacite to provide minimal, standardized metadata for our recommendation for an interoperable, BagIt-based exchange format for digital content between repository platforms.

One possible concern about using datacite, the necessity of a DOI, came up during one of our virtual meetings. As we are not focussing solely on published datasets the presence of a DOI cannot be guaranteed, probably for the majority of packages created according to our recommendations. Luckily, the datacite schema documentation states that if [...]one of the required properties is unavailable[...] one should [...]use one of the standard (machine‐recognizable) codes listed in Appendix 3[...] (see Section 2.3).

However, according to the XSD schema this seems not to apply to the identifier. That's why I wanted to ask if this is a bug/feature in the schema implemenation or a misinterpretation/inaccuracy of the schema documentation?

Thanks in advance for the clarification.

Regards, Thomas

mfenner commented 6 years ago

Thanks @ThomasJejkal. The DOI is required for schema validation, because it is required for DOI registration. It is probably easiest to discuss your use case in more detail (e.g. via email or phone), as there are several paths forward from a DataCite perspective.

DataCIte is involved in the EC-funded FREYA project that started in December where we are discussing bagit (with a DataCite XML metadata file inside). See for example https://github.com/datacite/freya/issues/2

ghost commented 6 years ago

EZID's approach is to fill in the identifier itself (since the identifier is always known from the operation being performed). That way the user can leave it unspecified.

RKrahl commented 6 years ago

@mfenner, I agree that for your use case, DOI registration, it is obvious that a DOI is required. The question is, whether the DataCite standard may also be used for other cases.

The Research Data Repository Interoperability WG is concerned with standards for interoperability between different research data repository platforms. The goal is, to make it easier to move data from one repository to another. We defined a package format based on BagIt for the transport of such data. Of course, we need to include metadata in the packages. We decided to use DataCite as the minimal metadata standard to describe the metadata that must be include in the package.

Note that the data moved from one repository to another may or may not have a DOI. So we must account for the case that data does not have a DOI in our package format. During the discussion, the concern has been raised that we cannot use DataCite because it has Identifier as a mandatory property, having DOI as the only allowed identifierType. From this has been inferred that DataCite would require the data to have a DOI, which is not always given in our use case. If you look into the written DataCite standard, it reads in Section 2.3 DataCite Properties on page 10:

Table 3 provides a detailed description of the mandatory properties, which must be supplied with any initial metadata submission to DataCite, together with their sub‐properties. If one of the required properties is unavailable, please use one of the standard (machine‐recognizable) codes listed in Appendix 3, Table 11.

E.g. the standard values for unknown information in Appendix 3, Table 11 are allowed to be used for the mandatory properties listed in Table 3, which includes the Identifier property. That would mean that DataCite does not require the described resource to have a DOI, but only to state explicitly whether it has one and to provide it, if available. As a result, the following example would be valid DataCite metadata:

<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4/metadata.xsd">
    <identifier identifierType="DOI">(:unas)</identifier>
    <creators>
    <creator>
        <creatorName>Doe, John</creatorName>
    </creator>
    </creators>
    <titles>
    <title>Some title</title>
    </titles>
    <publisher>Lebesgue Institute for Strange Materials (LISM)</publisher>
    <publicationYear>2013</publicationYear>
    <resourceType resourceTypeGeneral="Dataset">Measured Data</resourceType>
    <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="LISM data record number">2013-R-4711</alternateIdentifier>
    </alternateIdentifiers>
    <descriptions>
    <description descriptionType="Abstract">
        Some nice description of the content.
    </description>
    </descriptions>
</resource>

On the other hand, the XML Schema Definition provided at the DataCite web page defines the value type for the Identifier property as:

  <xs:simpleType name="doiType">
    <xs:restriction base="xs:token">
      <xs:pattern value="10\..+/.+"/>
    </xs:restriction>
  </xs:simpleType>

This excludes the standard values for unknown information.

Now the questions are:

Given the contradiction between the text of the written DataCite standard and the XML Schema Definition, which of both is correct?
Is DataCite intented do be useful also for other use cases then DOI registration?
Does DataCite require the described resource to have a DOI or does DataCite intent to cover also use cases such as ours, where a resource might not have a DOI?

mfenner commented 6 years ago

@RKrahl we use the XSD to validate metadata on DOI registration, so that is the authoritative source, and the documentation needs to be updated.

I suggest you make one small change to the XSD, requiring an identifier, but not a string with a DOI pattern. But I don't see DataCite changing the XSD in that regard any time soon, as this breaks the current DOI registration workflow as implemented now. Feel free to reuse the XSD, like everything else from DataCite there are no restrictions in reusing and changing - as long it is clear that this is not the official DataCite XSD, but a modification.

mfenner commented 6 years ago

@RKrahl DataCite is currently not intending to support other identifiers besides DataCite DOIs. The use of DOIs is highly integrated with the metadata schema, e.g. to have a central search index for all metadata, to use the handle infrastructure to resolve DOIs, and a commitment to long-term archiving.

RKrahl commented 6 years ago

Ok. It seems that there has been a misconception from my side about what DataCite is. I considered the DataCite schema as a metadata standard to be used for various purposes. (And I know many people in the community that use the DataCite schema in exactly this way.) From your replies, it seems that it is only intended to be the input format for the particular DOI registration service that DataCite provides. I did not intended to ask for any modification whatsoever in the workflows of the registration service. I was asking for a clarification in the DataCite metadata standard. If you don't intend it to be such a standard in the first place, it's understandable that you can not deliver what I was asking for.

mfenner commented 6 years ago

I hope the DataCite schema is a community standard. But for me the identifier and metadata are tightly linked to each other for a number of reasons, and I can't follow the logic of separating them out.

It would certainly help me to understand under what circumstances someone would want to use the DataCite metadata schema not with DataCite DOIs, but with another identifier.

ghost commented 6 years ago

We've heard from people who just want to use it as a standard for describing resources, unrelated to DOIs or DataCite. Which is why EZID has extended the schema to allow it to be used with other identifier types as in, e.g., ark:/12345/xyz.

ThomasJejkal commented 6 years ago

Well, as pointed out we in our RDA WG are focussing on recommendations for research data repository interoperability. This goal we try to achieve by using a BagIt-based approach for exchanging digital content (not only published content!) between different platforms.

In order to have some lowest common denominator we were looking for a generic metadata standard which can be used as common ground to provide a small set of metadata that can be interpreted by any platform adopting our recommendations. As a result of a technology assessment we figured out, that datacite is used or at least planned to be used by several platforms. Therefore, we've decided to take up datacite for our purposes.

However, as there are plenty platforms out there not supporting datacite as their internal metadata model this typically means, that some export tool has to map from the platform model into datacite in order to provide a bag that can then be imported by another platform, which eventually has to map datacite again into its own model.

As the set of mandatory elements in a datacite document is rather small this is no problem, except if a DOI is mandatory because not all repositories may use DOIs and not all exchanged digital content is even eligible for getting a DOI assigned as it is not necessarily published or will ever be published.

RKrahl commented 6 years ago

@mfenner, I completely agree that the identifier is an important property in the metadata. I also agree that the DOI is a particular useful type of identifier, which justifies its distinguished position in the DataCite schema. The only problem is that there is plenty of data in our repositories that simply does not have a DOI and will never have one, for various reasons. We also need to deal with these data. So, what can we do:

throw away all data not having a DOI? (Not an option)
relinquish providing any metadata for these data? (Not an option)
not using any metadata standard? (Not an option)
use a metadata standard that allows to leave the DOI out? (Currently preferred option)
any other idea?

(Again, we do not plan to register any of these data with DataCite or to bother you in any way with it. The question is only how we should deal with these data and whether we can use the DataCite schema as the metadata standard for our use case.)

mfenner commented 6 years ago

ThomasJeykal, can you send me an email at mfenner@datacite.org? It might be easier to speak on the phone, as we are working on the same questions in the EC-funded FREYA project that just started. Including:

what are the core metadata for a dataset? The required DataCite metadata are for published datasets, and might not always be a good fit for unpublished data.
how can we make this work for other identifiers, specifically in the life sciences, which typically don't use DOIs for datasets. DataCite is working with identifiers.org on this. And we use schema.org metadata, which map nicely with DataCite metadata, and can be embedded into repository landing pages.
How can package datasets so that they can easily be downloaded or archived? We are also looking at bagit.

Short version: consider schema.org instead of DataCite XML.

hvwaldow commented 6 years ago

The DataCite Schema has received a lot of adoption outside the immediate application for DOI-registration because it is among the most practical and capable ones to describe sci. datasets. It would be a pity if this issue prevented further adoption, particularly because the whole enterprise suffers from lack of consolidation.

Validation of the DOI in a DOI registration includes checks for the correct prefix and the uniqueness of a DOI. From a conceptual point of view, ignorant of the actual worflow implementation, it seems reasonable to move the check whether there is a DOI at all also out of the xsd.

Alternatively, what about a fork for general applications that is otherwise kept in sync with the official schema?

mfenner commented 6 years ago

@ThomasJejkal and I discuss this on the phone yesterday and we will come up with a practical solution. There are several options, including a fork, but also using the schema without validation. One argument for the latter is that users might not only use a different identifier, but might also not use all the properties required by the current schema.

I agree that the DOI check in the schema is not ideal, because is doing both too much (for the use case discussed here) and too little, as the validation allows many DOI names that would not be a good idea. But the practical implementation of this takes time, and the RDA working group is working on their recommendations now.

mfenner commented 6 years ago

Closing this after further discussion offline, and hopefully more discussion at the RDA Plenary in Berlin next week.

mfenner commented 5 years ago

Starting with the upcoming schema 4.2 we will no longer check the format of the identifier (<xs:pattern value="10\..+/.+"/>), or check that the identifierType is a DOI. We are already doing these checks in our API, and this will allow re-use of the metadata schema for other identifier types. The only requirement is that the identifier attribute is not empty.

ThomasJejkal commented 5 years ago

Great news, thanks for the notification.

mfenner commented 5 years ago

We expect schema 4.2 to be released in March. The draft schema and documentation are here: https://schema.test.datacite.org/meta/kernel-4.2/

datacite / schema

Validation of identifier #43