PASTAplus / PASTA

Repository for the Provenance Aware Synthesis Tracking Architecture (PASTA) project.
12 stars 3 forks source link

Link to dataset published in another repository #56

Open vanderbi opened 3 years ago

vanderbi commented 3 years ago

The use case is that a scientist has published data in the KNB, but it is an LTER dataset and I need to be able to link to the data from EDI without giving it a new DOI. FCE generates its data catalog by harvesting metadata from EDI.

See here for PEP: https://docs.google.com/document/d/1kcTn9y18KaTa9V_NVlwvdvX6gZyEzU9knmfOoLy1h34/edit?usp=sharing

servilla commented 2 years ago

Major considerations:

  1. What would the EML look like to describe a data package in another system? Are there required elements beyond the minimal set required for EML to be valid? How would the ECC process respond to such an EML document?
  2. If a data package is created as linked, what are the implications for an update? If a data package is created as linked, would that mean any update must also be linked, or could PASTA hold the data for an updated version of the data package? Or conversely, could an update to a local data package become linked?
  3. A linked data package may not be a candidate for a PASTA DOI if the authoritative repository had already assigned and registered a DOI for the linked data package. PASTA's create and update data package APIs should accept a query parameter to negate the assignment of a PASTA DOI. What about the DOI scanner process - how would a data package that is registered in PASTA be flagged so that the DOI scanner does not attempt to assign a DOI to one of these linked data packages? This use case may be applicable to data packages beyond those linked in other repositories.
mobb commented 2 years ago

Here is spreadsheet that describes some cases where a dataset might have no data entities ("metadata-only"). It might help define these cases. https://docs.google.com/spreadsheets/d/1eA4-ggvJ36J4LIPNnQom_XW8rrHe8ni6hJCUlRGGQxA/edit#gid=0

mbjones commented 2 years ago

FYI, we've been having discussions about these external data linkages with ESS-DIVE and Arctic Data Center and how to do them consistently across repositories using EML. If you're interested, maybe this would be a good topic to discuss across groups? ESS-DIVE has a detailed proposal for using annotations with DataCite related identifier properties, but unfortunately they work in a private repository so the ticket isn't publicly visible. But I could talk to @vchendrix about sharing if it is of interest.

mbjones commented 2 years ago

Also, @vanderbi, we have to deal with this same issue a lot with the Arctic Data Center where researchers have their data in one repo (e.g., KNB, LTER, BCO-DMO, etc) but they are required by NSF policy to have a copy in the Arctic Data Center. Our preferred approach in that case is to replicate the data package from e.g., EDI to ADC, keeping the exact same contents and identifiers. So, it shows up as a replica in the ADC with the DOI ad citation from LTER/EDI, and DataONE knows its the same dataset, so only shows it once. Can PASTA replicate the dataset from the KNB exactly and keep track that it is a replica?

This is a slightly different case than the metadata only record that Margaret raised to link to data files externally. @vchendrix has some other use cases for external linking outlined as well.

servilla commented 2 years ago

Hi @mbjones - yes, these are important and forefront issues that would be best discussed as a community so that we have a consistent and unified solution. PASTA cannot handle the linked-data use case (i.e., replicate from KNB and track as such) in its present state, but we believe there is a workable solution that can be implemented without too much effort (TBD). That being said, I think a group discussion would be valuable - count me (EDI) in.

mbjones commented 2 years ago

That's great, @servilla -- I think a DataONE Community call would be a great place for this discussion. Would that be ok? Just in case, I started a session description for brainstorming here: https://github.com/DataONEorg/community-calls/issues/16 Please contribute ideas and thoughts on what we might discuss and who might be involved -- I threw down a few initial reactions, but feel free to change/elaborate.

gkamener commented 4 months ago

@servilla, this feature would be fantastic to see in EDI! Is it still scheduled for development?

servilla commented 4 months ago

@gkamener, this feature is still being worked on as we refine our use of annotations. We expect to have a recommendation early this summer 2024. Sorry that it is slow to arrive.