Open mbjones opened 6 years ago
Original Redmine Comment Author Name: Chris Jones (Chris Jones) Original Date: 2011-10-28T21:23:22Z
Creation of SystemMetadata when no SystemMetadata is provided (i.e. any object creation not through the DataONE API) is currently being done in two separate classes: MetacatHandler (on insert and update events) and MetacatPopulator (Ben updated this to work manually for the 0.6.4 API, Robert recently migrated it to the 1.0.0 API). A class similar to MetacatPopulator should be created that refactors the functionality and those two classes should then call the class with the common code.
Original Redmine Comment Author Name: ben leinfelder (ben leinfelder) Original Date: 2011-11-30T21:06:12Z
Using the Foresite library in Metacat to build the ORE maps fails because of jar dependencies. The dependencies are as follows: d1_libclient -> Foresite -> Jena -> Xerces ORE generation fails because Jena is expecting Xerces v2.7 (and works with v2.6) but Metacat was recently upgraded to use v2.11.
Please see the 5291 bug about Xerces 2.11 and sensorML validation. It would be nice to revert back to Xerces 2.7 so that the libraries were compatible, but that might invalidate sensorML documents that are already in Metacat deployments (it was released in Metacat 1.9.5)
Original Redmine Comment Author Name: ben leinfelder (ben leinfelder) Original Date: 2011-12-01T20:29:01Z
Disregard the Xerces panic -- I had an old XercesImpl.jar hanging out in my classpath.
Original Redmine Comment Author Name: ben leinfelder (ben leinfelder) Original Date: 2011-12-08T23:52:44Z
Items I am suspicious about: 2(c)(i) -- Generating new objects from external data (URLs) that metadata points to. There's usually a reason they are not in Metacat, right? Some of them might be very large?
2(b) (the second set) -- If we update only a data object via the Metacat API, nothing else should happen. If it is part of an EML package, the EML file will also be updated (to use the new data object's revision number). So nothing should be triggered by the data update in terms of ORE regeneration.
2(?) -- There's currently only a loose association between ORE documents in Metacat and the documents they describe (which are assumed to be in Metacat but are pointed to with DataONE endpoints). So if we do update an EML package, we'll have to [somehow] search our local Metacat for any ORE package that uses the EML package as a basis for the ORE package, set it as obsoletedBy the new ORE package we generate for the EML file and add that new ORE file to Metacat. It's that step where we find the ORE files that use our EML file that scares me -- is this just a search against the ORE RDF/XML file? That's as formal as we can get with the current infrastructure or ORE maps in Metacat.
Original Redmine Comment Author Name: ben leinfelder (ben leinfelder) Original Date: 2011-12-16T18:03:00Z
From discussion yesterday:
Original Redmine Comment Author Name: ben leinfelder (ben leinfelder) Original Date: 2012-01-05T22:15:12Z
I'm now downloading remote data that is referenced by EM documents, saving it on the MN with an "autogen" ID and including that in the ORE map.
Original Redmine Comment Author Name: ben leinfelder (ben leinfelder) Original Date: 2012-01-14T00:55:50Z
Additional notes:
Original Redmine Comment Author Name: ben leinfelder (ben leinfelder) Original Date: 2013-01-24T19:43:07Z
Are we committed to doing this? LTER was going to be a major source for new data, but perhaps plans have changed. Revisit for 2.1 release.
Original Redmine Comment Author Name: Redmine Admin (Redmine Admin) Original Date: 2013-03-27T21:30:46Z
Original Bugzilla ID was 5522
Author Name: Matt Jones (Matt Jones) Original Redmine Issue: 5522, https://projects.ecoinformatics.org/ecoinfo/issues/5522 Original Date: 2011-10-28 Original Assignee: ben leinfelder
The KNB data sets, and EML data in general, represent linkages to data as online/url linkages in EML documents. When we convert to the KNB to a DataONE Member Node, we need a mechanism to convert these EML packages to create DataONE ORE-base data packages. Depending on the specific situation, different steps will need to be taken:
1) For packages that arrive via the DataONE services, do nothing 2) For packages that arrive via the Metacat and EcoGrid services, check all online/url links: a) if it is an ecogrid:// link, then create the corresponding link in an ore document b) if it is a URL marked as "information" in EML, ignore it c) if it is a URL marked as "download" in EML, then: i) attempt to download the data, and if successful -- check if it is real data (hard to do, but filtering out obvious HTML errors, login pages, HTML pages, etc would be tractable) -- insert it into the MN using the permissions and policies specified in the EML document (need to determine what the ID would be for this object -- maybe the original URL, but need to ensure uniqueness and < 800 chars, etc) -- add a link to the ORE document for this dataset d) insert the final ORE document that's been assembled (need to determine the identifier to use)
This utility method should be callable in two ways: 1) For an existing EML document already in metacat, likely to be run on initial conversion and periodically to be sure all proper data packages are created -- need to be sure that this doesn't create duplicate packages 2) On any INSERT or UPDATE calls -- when EML is updated, need to rebuild the package -- when data objects are updated, need to rebuild the package -- but need to watch out for sequential ops not interfering (e.g., when Morpho updates a data file, then updates a EML file to point at the new data file in a second step, we should only create one new ORE package version) -- on update calls, be sure to set appropriate obseletes/obsoletedBy properties on the ORE package (the update() calls themselves should handle these properties for the sysmeta for EML and data objects already)