CDLUC3 / mrt-doc

Documentation and Information regarding the Merritt repository
8 stars 4 forks source link

Sunset Merritt Member Node #437

Open marisastrong opened 4 years ago

marisastrong commented 4 years ago

To turn off the Merritt/CDL Member Node: https://merritt.cdlib.org/m/ucd_ice_swap collection is stored on Node 5001 (S3) so we already serve it up from there.

Content:

Infrastructure:

mreyescdl commented 4 years ago

Archiving of DataONE Merritt MN complete. https://merritt.cdlib.org/m/ark%253A%252F13030%252Fm5z94qw1

mreyescdl commented 4 years ago

Removed all DataONE references from profiles (Stage and Production)

Note: The following collections seem to be specific to DataONE. Are the still needed @elopatin @marisastrong

STAGE oneshare_ark_only oneshare_ark_only.orig oneshare_dataup_content oneshare_dataup_content.orig dataone_dash_content dataone_dcxl_content dataone_demo_content

PROD dataone_dash_content demo_dataone_content oneshare_dataup_content

elopatin-uc3 commented 4 years ago

Thanks for the update @mreyescdl.

Stage:

Prod: @marisastrong please chime in on these (and Stage). I'm only noting LDAP and object status:

Based on Stage results, I'm comfortable with removing all of these collections except dataone_dash_content (check in with Scott). With the exception of demo_dataone_content, I assume we'll want to hang onto the production collections for a little while.

marisastrong commented 4 years ago

My understanding is that collections with the _content suffix are how collections are referred to within Merritt but to how they are referred is unknown to me. Given that the following

have dataone_dash_submitter associated with them and have seen fairly recent activity, I would keep the production and stage collections around until it's understood how we are referring to them.

marisastrong commented 3 years ago

Here are some notes from Matt Jones/DataONE for steps that need to occur before we decommission the Member Node hosted at CDL

As DataONE is concerned about long-term access to data and persistence of published identifiers, we don't delete the old records and associated identifiers, so if people cited or bookmarked a particular identifier, it will continue to resolve. What this means for the repo is that best practice would involve:

1) Make a copy of all existing objects from Merritt to Dryad, maintaining the same identifiers and object checksums 2) for each object, change the authoritativeMN field to point to Dryad -- which gives Dryad admin control over everything 3) if you plan to replace the existing data and metadata with new versions that follow Dryad metadata practices, mark the new data and metadata objects as replacements of the old ones (via the obsoletes field), so that DataONE knows that you have published new versions and will surface only the new versions in search.

At that point, we can mark the old Merritt node as down, and any new requests for that data would point at Dryad.

Dryad itself is not currently being harvested in DataONE after the switch off of DSpace. Still not getting harvesting to work for Dryad using their schema.org entries at Dryad. Dryad is now publishing the schema.org, but DataONE needs to harvest. Dryad also has this same issue with obsoleting the old content correctly as well.

marisastrong commented 3 years ago

Dryad has updated everything for DataONE to start harvesting from us. We need to prioritize on DataONE's end to finalize the work.

marisastrong commented 3 years ago

Met with Daniella and Eric to discuss next steps. Plan to zip up all files in the ICE collection, including a mapping file with ARKs contained the zip file and brief informational text describing the contents and note that all content is still preserved in the Merritt repository This zip file will be deposited into Dryad and issued a DOI.
EZID will update all the ARKs in the collection to now resolve to the Dryad DOI.
Once Dryad MN begins harvesting again, any existing ARKs cited or bookmarked will resolve to the DOI containing all ICE objects.

marisastrong commented 3 years ago

The content for the datasets is served up from the member node itself. So if the member node is taken down, the coordinating node at DataONE would not be able to serve up the content.

Content / objects that will be deposited into Dryad should provide the mechanism to serve up to the coordinating node.

elopatin-uc3 commented 3 years ago

Script to download all Davis ICE objects is now in place and ready to run four batches of ARKs. It's up on the second Docker dev box, here: ingest-stg-shared/dataone/ice_arks

I will kick off the first batch tomorrow morning. These will be post-processed with a new routine in Terry's File Analyzer: https://confluence.ucop.edu/display/~tbrady/UC+Davis+Object+Prep https://confluence.ucop.edu/pages/viewpage.action?spaceKey=~tbrady&title=File+Analyzer+for+Iterative+Metadata+Preparation https://confluence.ucop.edu/display/~tbrady/Run+X11+File+Analyzer

elopatin-uc3 commented 3 years ago

First batch is downloading now.

elopatin-uc3 commented 3 years ago

All 4204 zips from first batch downloaded successfully. I'll start the second batch tomorrow morning, as it probably wouldn't complete by 7pm, when the dev box will automatically shut down.

elopatin-uc3 commented 3 years ago

Batch 2 done. I'll start the third tomorrow morning.

elopatin-uc3 commented 3 years ago

Batch 3 done.

elopatin-uc3 commented 3 years ago

Batch 4 done. We now have all 16,804 objects downloaded to ingest-stg-shared. I will start processing these with the new routine in Terry's File Analyzer on Tuesday.

elopatin-uc3 commented 3 years ago

All zip batched post-processed with the File-Analyzer to remove system files and conform to the directory structure and file lists we've discussed for the submission to Dryad. For example:

ark_13030_m5765f2s/1/mrt-dataone-map.rdf
  ark_13030_m5765f2s/1/mrt-dataone-manifest.txt
  ark_13030_m5765f2s/1/cadwsap-s3610008-005.xml
  ark_13030_m5765f2s/1/cadwsap-s3610008-005-main.csv
  ark_13030_m5765f2s/1/mrt-erc.txt
  ark_13030_m5765f2s/1/cadwsap-s3610008-005-vuln.csv
  ark_13030_m5765f2s/1/cadwsap-s3610008-005.pdf
  ark_13030_m5765f2s/3/mrt-dataone-map.rdf
  ark_13030_m5765f2s/3/mrt-dataone-manifest.txt
  ark_13030_m5765f2s/3/cadwsap-s3610008-005.xml
  ark_13030_m5765f2s/3/cadwsap-s3610008-005-main.csv
  ark_13030_m5765f2s/3/mrt-erc.txt
  ark_13030_m5765f2s/3/cadwsap-s3610008-005-vuln.csv
  ark_13030_m5765f2s/3/cadwsap-s3610008-005.pdf
  ark_13030_m5765f2s/2/mrt-dataone-map.rdf
  ark_13030_m5765f2s/2/mrt-dataone-manifest.txt
  ark_13030_m5765f2s/2/cadwsap-s3610008-005.xml
  ark_13030_m5765f2s/2/cadwsap-s3610008-005-main.csv
  ark_13030_m5765f2s/2/mrt-erc.txt
  ark_13030_m5765f2s/2/cadwsap-s3610008-005-vuln.csv
  ark_13030_m5765f2s/2/cadwsap-s3610008-005.pdf
  ark_13030_m5765f2s/manifest.xml

Next up will be prepping the .csv with object-level metadata per ARK.

elopatin-uc3 commented 3 years ago

CSV created with ARK, Title, Creator, and Filename columns. Sorted by Creator, as this includes the District information per dataset.

elopatin-uc3 commented 3 years ago

README file created: https://docs.google.com/document/d/114b1ipt677jozzCWHM7yxAwCgizdUjuMT2GyFcPynf8/edit?usp=sharing

This will be ported to a .txt file for inclusion with the submission.

elopatin-uc3 commented 3 years ago

Talked to Scott and he confirmed it's possible to change ownership after Dryad submission, as well as add authors if needed. He noted that creating a new owner account would be necessary, and in this case, we should consider if future dataset updates are a possibility (as the individual making the update would need knowledge of the ownership account).

elopatin-uc3 commented 3 years ago

@marisastrong I've filed a Dryad ticket for the dataset ownership change: https://github.com/CDL-Dryad/dryad-product-roadmap/issues/1158

marisastrong commented 3 years ago

Recap of call with DaveV: Merritt MN content will need to be archived in the DataONE CN side so the content is no longer discoverable. Merritt MN content will be submitted to Dryad as a single tarball along with a metadata file containing all the ARKs in the tarball. This metadata will be translated into schema.org format which can then be harvested by DataONE.
DaveV following up with DataONE devs to see how they can have old content be discoverable as new content; mapping of multiple ARKs to the single file tarball is something they haven't supported before.
We do not need to update EZID to have ARKs redirect to anywhere.

elopatin-uc3 commented 3 years ago

Test submission on Dryad stage: https://dryad-stg.cdlib.org/stash/dataset/doi:10.7959/dryad.3xsj3tx9r

elopatin-uc3 commented 3 years ago

@mreyescdl the ICE dataset zips are staged here: /apps/ingest-stg-shared/dataone/ice_arks/

elopatin-uc3 commented 3 years ago

@marisastrong The submission to Dryad production is complete. https://doi.org/10.18737/D7H30S Daniella has notified Dryad curators and Scott will need to adjust the ownership tomorrow.

elopatin-uc3 commented 3 years ago

Unfortunately the above DOI was associated with UCOP rather than Davis. Scott and I are going to start over, delete this one and resubmit from scratch.

elopatin-uc3 commented 3 years ago

New submission complete: https://datadryad.org/stash/dataset/doi:10.25338/B8CH02

elopatin-uc3 commented 3 years ago

Dataset authors updated to LW and PC. I'm still owner of it in case we need to make any updates. Scott will check in with Daniella to confirm this is moved through curation.

datadavev commented 3 years ago

wrt the Merritt MN deprecation, a goal of DataONE is to ensure ongoing access to content once it has been registered. In this case, the Merritt content already registered in DataONE will be retained as replicas on other nodes participating in DataONE, and the content will still be accessible through the existing identifiers.

Shutting down the Merritt node can be straight forward, and basically involves:

  1. Change authoritative MN to another node in DataONE where replicas are located
  2. Optionally archive the content so that it no longer appears in searches (though is still downloadable)

Both of these steps can be performed within DataONE.

But ...

It can be beneficial to users to indicate that the content is replaced by an aggregation of what were previously individual records. This can be achieved through prov:wasDerivedFrom and the inverse prov:hasDerivations. The aggregated content is to be made available through Dryad, and Dryad is soon to be harvested by DataONE through the schema.org metadata published by Dryad. DataONE maps the schema.org/isBasedOn property to prov:wasDerivedFrom.

Hence, adding the property schema.org/isBasedOn with the value being a list of identifiers of the contained content would enable a user that finds the resource on Dryad (e.g. via DataONE search) to determine that Dataset is composed of the other content originally available from Merritt. A potential the number of identifiers that appear in the value of the schema.org/isBasedOn property may be quite large.

Adding the inverse relationship to the existing content (i.e. the resource maps already published to DataONE) would be cumbersome since it would require generating new identifiers for all the existing, however the size of the entry would be quite small (a single identifier referring to the aggregated dataset).

Completing the deprecation of the Merritt MN while retaining existing content and cross referencing the existing with the new aggregate involves:

marisastrong commented 3 years ago

Thank you @datadavev for this write-up of next steps and alternatives. A couple questions regarding the less straightforward option:

  1. For retaining existing content and cross referencing with new aggregate, the step to "Investigate updating all resource maps for Merritt" . Is this updating of the resource map performed by the Merritt MN? And does this resource map need to get included with the object stored in the Dryad MN?

  2. Is is not clear "the Merritt content already registered in DataONE will be retained as replicas on other nodes participating in DataONE, and the content will still be accessible through the existing identifiers." The existing identifiers are pointing to ? Merritt MN or DataONE CN?

I'm trying to confirm if all these steps can be performed if the Merritt MN is no longer running on our end.

datadavev commented 3 years ago

Item 1: The resource maps are used to identify aggregates of information - such as the data files and metadata of a data set. The usual method for updating the resource maps is through the MN, but this is not necessary. One thing that will be needed if this update is not performed on the MN is a list of identifiers for the old objects that have been bundled and placed on Dryad, and the identifier for the bundle that has been created. This will enable mapping from the old to the new to be determined.

For item 2: Search results in the DataONE search UI use the CNs to resolve the location of identifiers. Similarly, any other systems that specifically use the DataONE CNs to resolve identifiers will be pointed to content within the DataONE environment. Resolving the same identifier with another system such as N2T or EZID will resolve to where ever that metadata is directing the client.