DataONEorg / community-calls

Ideas and topics for DataONE Community Calls
2 stars 1 forks source link

approaches to cross-repository dataset replication and linking #16

Open mbjones opened 2 years ago

mbjones commented 2 years ago

Topic Description

Repositories frequently need to both replicate datasets that are held in other repositories (for policy, availability, and other reasons) and to link to external datasets that represent sometimes the same and sometimes related datasets. For example, at the Arctic Data Center, we frequently need to replicate datasets from EDI to meet NSF policy guidelines, and so we have worked out a streamlined workflow to make sure that researchers do not have to double enter their metadata or data. Being able to generalize these capabilities across the network could increase efficiency and reduce duplication.

Relationship to DataONE?

While DataONE provides some mechanisms for replication across repositories that includes identifier deduplication and other services, there are other use cases that have not yet been covered such as how to link to data files stored on other locations, or to non-exact replicas on other repositories.

Possible speakers

We try to keep speaking to a short 1-15 minutes total so that 45 minutes can be used for discussion on the community calls. Maybe we could pick one or two of the following folks to introduce the topic and get the conversation started?

Related resources and links

servilla commented 2 years ago

Hi @mbjones - I would be glad to introduce the topic from EDI's perspective. Do you have any tentative dates for this call?

mbjones commented 2 years ago

Hi @servilla the community calls are typically held monthly on the first Thursday of each month (see details: https://www.dataone.org/community-calls/), but we've been on hiatus for the summer, and hadn't resumed yet. @karlbenedict helps with scheduling, but I think we could do it soonish, possibly even Oct 7 if the folks contributing wanted to do so.

karlbenedict commented 2 years ago

October 7 may be a bit soon to be able to get the word out as we are only a week out from then. Depending on the status of our currently planned November topic we could more reasonably do this one in November or December.

Thanks, Karl

Karl Benedict Professor, Director of Research Data Services / Director of IT College of University Libraries and Learning Sciences University of New Mexico

Office: Centennial Science and Engineering Library, Room L173

Make an Appointment: @.***/bookings/

From: Matt Jones @.> Date: Thursday, September 30, 2021 at 19:55GMT To: DataONEorg/community-calls @.> Cc: Karl Benedict @.>, Mention @.> Subject: Re: [DataONEorg/community-calls] approaches to cross-repository dataset replication and linking (#16) [EXTERNAL]

Hi @servillahttps://github.com/servilla the community calls are typically held monthly on the first Thursday of each month (see details: https://www.dataone.org/community-calls/), but we've been on hiatus for the summer, and hadn't resumed yet. @karlbenedicthttps://github.com/karlbenedict helps with scheduling, but I think we could do it soonish, possibly even Oct 7 if the folks contributing wanted to do so.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/DataONEorg/community-calls/issues/16#issuecomment-931619258, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAMQACQ6IKWZSGY3ZQUHZZTUES6CTANCNFSM5FBJ2EQQ.

mbjones commented 2 years ago

Yeah, it is very soon. But that might also be ok if we only had a smallish group for the discussion, while still being open to anyone attending. Or we could wait until November if there isn't a rush on this -- it was triggered by specific needs at EDI, so hopefully folks there can weigh in on the timing.

servilla commented 2 years ago

November would work better for me - gives me more time to better understand any code changes we'll be in for.

Thanks, Mark


Mark Servilla @.***

On Thu, Sep 30, 2021 at 3:19 PM Matt Jones @.***> wrote:

Yeah, it is very soon. But that might also be ok if we only had a smallish group for the discussion, while still being open to anyone attending. Or we could wait until November if there isn't a rush on this -- it was triggered by specific needs at EDI, so hopefully folks there can weigh in on the timing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataONEorg/community-calls/issues/16#issuecomment-931712854, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQ7EU5L3D57WFINZVNUHDDUETH77ANCNFSM5FBJ2EQQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

vchendrix commented 2 years ago

We have been thinking about this on ESS-DIVE and have a few initial uses cases we are trying to support in the near term. I have forwarded this community call to my colleague who has been thinking about this for a while.

jeanetteclark commented 2 years ago

Happy to present for the Arctic Data Center - we often need to do cross repo linking/replication.

vchendrix commented 2 years ago

ESS-DIVE external linking (under review)

The following is the current work under review for ESS-DIVE. (ESS-DIVE: @JEDamerow @shreddd)

The ability to provide a link to data file(s) outside of ESS-DIVE. Instead of uploading data files to a dataset the user could provide a link to the data along with metadata about the data package being linked to. Our initial use cases will support the ability to link out to (meta)data at other repositories or ESS-DIVE Tier 2 storage.

Use Cases

  1. External link to data file(s) distributed as part of the dataset
  2. External link to a complete copy of the data in the dataset
  3. External link to original publication of dataset where metadata and data can be found.

ESS-DIVE uses EML as the underlying metadata format and also has a REST API which translates ESS-DIVE generated EML to JSON-LD. Thus, there is a requirement to be able to translate external linking in both JSON-LD and EML. We have had conversations with several folks internal and external to ESS-DIVE and gone through several iterations (6+) of thought exercises on ways to capture external linking in both metadata formats that allows for a smooth translation between the two formats. I will not go over these iterations. The following is a description of our current thinking.

Our current iteration (and close to final pending team review), is to use schema.org metadata to express the three use cases mentioned above in both metadata formats (EML, JSON-LD).

We will use EML annotations to create a semantic triple ([subject] [predicate] [object]).

Use Case 1: External link to data file(s) distributed as part of the dataset

One or more files that are part of a data packages resides outside of the main archive. This could be a link to an individual file or a directory.

EML (Annotation on the dataset)

Uses schema.org vocabulary to describe the external links. In this case, "Dataset has part orthomosaiced estimated reflectance data". The inverse would be "orthomosaiced estimated reflectance data is part of the Dataset.

<dataset id="<identifier>">
...
<annotation>
       <propertyURI label="has part">
            https://schema.org/hasPart
      </propertyURI>
      <valueURI label="orthomosaiced estimated reflectance data">
            https://portal.nersc.gov/wfsfa/doi-10-15485-16181314/
      </valueURI>
</annotation>
...
</dataset>
...

JSON

This use case translates to schema.org hasPart

{
  "@type": "Dataset",
  "hasPart": {
         "@type": "WebPage",
         "name": "orthomosaiced estimated reflectance data",
         "url": "https://portal.nersc.gov/wfsfa/doi-10-15485-16181314/"
   }
}

ESS-DIVE UI Example

Screen Shot 2021-10-05 at 8 50 21 AM

Use Case 2: External link to a complete copy of the data in the dataset

Another complete copy of the data in the data package resides outside.

EML annotations

This use case translates to schema.org archivedAt vocabulary to describe the relationship. In this case, "Dataset archived at Globus Copy at NERSC".

<dataset>
...
<annotation>
    <propertyURI label="archived at">
        https://schema.org/archivedAt
    </propertyURI>
    <valueURI label="Globus Copy at NERSC">
        https://app.globus.org/file-manager?origin_id=211394dc-e1a0-11ea-9ef9-0aba3c43875b&origin_path=%2Fdoi-10-15486-ngt-1770776%2F
    </valueURI>
</annotation>
...
</dataset>

JSON

This use case translates to schema.org archivedAt which is pending implementation feedback and adoption from applications and websites.

{
 "@type": "Dataset",
  "archivedAt" :  {
      "@type": "WebPage",
      "name": "Globus Copy at NERSC",
      "url":  "https://app.globus.org/file-manager?origin_id=211394dc-e1a0-11ea-9ef9-0aba3c43875b&origin_path=%2Fdoi-10-15486-ngt-1770776%2F"}
}

Use Case 3: External link to original publication of dataset where metadataand data can be found

The orignal landing page where the data can be found.

EML Annotations

This use case translates to schema.org sameAs and identifier
In this case, "Dataset same as https:doi.org/10.25581/spruce.048/1425889".

<dataset>
...
<annotation>
    <propertyURI label="sameAs">
        https://schema.org/sameAs
    </propertyURI>
    <valueURI label="doi:10.25581/spruce.048/1425889">
        https:doi.org/10.25581/spruce.048/1425889
    </valueURI>
</annotation>
...
</dataset>

JSON

This use case translates to schema.org sameAs and identifier

{
  "@type": "Dataset",
  "identifier": {
       "@type": "PropertyValue",
       "propertyID": "DOI",
       "value":  "10.25581/spruce.048/1425889"
   },
   "sameAs":"https:dx.doi.org/10.25581/spruce.048/1425889",
}

ESS-DIVE UI Example

Screen Shot 2021-10-05 at 8 52 14 AM

Future work

Ability to link precisely to related resources, which are important for interpretation, search, access, integration, and reuse - particularly for interdisciplinary data. This could include related datasets, sample metadata, sample data, methods/protocols, and the paper associated with a dataset.

For this we will explore the use of DataCite metadata scheme relationType from relatedIdentifiers with EML annotions. In JSON-LD, we will experiment with mapping the datacite vocabulary in @context.

aebudden commented 2 years ago

It seems that November is preferable and more reasonable given the October date is tomorrow. The first Thursday would be November 7th. We hold these at either 1000 Pacific or 1700 Pacific - alternating between the two. Unfortunately, Matt, Jeanette and I are running a training activity all that week. Given conflicts, and the summer break, do we want to find a different date vs waiting until December?

JEDamerow commented 2 years ago

Any update on when this may take place?

aebudden commented 2 years ago

Scheduling for Wednesday Nov 10th 1700 UTC @JEDamerow @jeanetteclark @servilla - does that work for you?

jeanetteclark commented 2 years ago

@aebudden I'll be on vacation that day

mbjones commented 2 years ago

We discussed this in the Arctic Data Center call today, and proposed that Natasha Haycock-Chavez (@nhchavez) present instead of Jeanette for the Arctic Data Center. She agreed to work with @jeanetteclark and me on it, and she can give a nice intro to the topic generally, as well of how we have handled replication and linking at the ADC. On the TT call today, @karlbenedict agreed to update the website with the new bio info. We also need to confirm if @JEDamerow or someone from ESS-DIVE would be able to help with the framing of the space.

With all of this, we need to keep in mind that the speaking part of the session should take up a total of no more than 20 minutes, so that the majority of the session is available for structured discussion. So that probably means max 5 minutes each to frame the discussion.

mbjones commented 2 years ago

@nhchavez this is the issue discussing the community call... thanks.

JEDamerow commented 2 years ago

Scheduling for Wednesday Nov 10th 1700 UTC @JEDamerow @jeanetteclark @servilla - does that work for you?

That works for me.

aebudden commented 2 years ago

Zoom line for the community call tomorrow: https://ucsb.zoom.us/j/94309556242 Hack pad for notes: https://hackmd.io/EKi9azkVTzW2FzmsZPjIgw See you at 1700 UTC, 0900 Pacific