EDIorg / data-package-best-practices

Best Practices for data packages. a gh-pages website, with sections for metadata concepts and aspects of data packaging
https://ediorg.github.io/data-package-best-practices/
14 stars 6 forks source link

New Guidelines for Defining Data Package Replication #88

Open clnsmth opened 1 week ago

clnsmth commented 1 week ago

Hi everyone,

We're excited to announce the release of new guidelines for defining data package replication between EDI and other repositories. These guidelines offer solutions for describing replication at both the data package and data entity levels.

To learn more about the release and access the guidelines, please check out the following resources:

These guidelines may be good additions to the data packaging best practices.

Thanks!

twhiteaker commented 1 week ago

What if we're just replicating metadata? Previously we've tried this when attempting to replicate metadata from EDI to Arctic Data Center.

Option 1: Add a snippet of XML into additionalMetadata:

<additionalMetadata>
    <metadata>
      <d1v1:replicationPolicy xmlns:d1v1="http://ns.dataone.org/service/types/v1" numberReplicas="1"
        replicationAllowed="true">
        <preferredMemberNode>urn:node:ARCTIC</preferredMemberNode>
      </d1v1:replicationPolicy>
    </metadata>
  </additionalMetadata>

Option 2: manual process.

  1. The dataset must be synced and indexed at search.dataone.org. If you search for "knb-lter-ble" and find the dataset, then it is indexed. Syncing is something EDI manages, but sometimes the process lags, so if you notice something isn't synced after a couple of weeks, contact EDI to see what's going on.
  2. Once the dataset is synced to DataONE, the BLE information manager must provide the DOI of the dataset to ADC so they can harvest the metadata.

Option 1 hasn't worked for a few years. And a manual process like option 2 isn't ideal. Should we just replicate the whole dataset? Or is there a way to just replicate metadata with semantic annotation?

clnsmth commented 11 hours ago

Thanks for your questions, @twhiteaker .

Regarding replicating metadata, I don’t believe this is currently possible because there’s no "subject" element in the EML record that could be used in semantic annotation to references itself. However, it’s possible that I may have missed something.

The second challenge is identifying a suitable "object" to reference. One option is using the URL of the metadata record, but this is less than ideal since URLs can change. Ideally, there would be a DOI for the metadata record that could be referenced, which would provide a more stable identifier. This issue is similar to describing entity-level replication within the EDI repository. Since we don’t assign DOIs to individual data entities, the best we can do is reference the data entity’s URL (as seen here).

Even if we overcame the above issues, we’d still face a "chicken and egg" problem: the user would need to know the DOI of the data package before it’s published in order to assign it in the metadata using the schema:sameAs annotation to reference itself. Since this isn’t possible from EDI’s side, the destination repository could add a sameAs reference to the replicated content it hosts. That said, perhaps ADC handles this differently? For example, the “Data Set Publishers” field on their data package landing page lists EDI as the publisher of the content.

As for whether there’s a semantic annotation mechanism to facilitate data replication—no, not at this time. The methods you mentioned are the only ones we’re aware of.