ESIPFed / Earth-Data-Provenance-Workshop

https://esipfed.github.io/Earth-Data-Provenance-Workshop/
Apache License 2.0
3 stars 5 forks source link

Work PROV into processing scripts for the USGS Spatial Feature Registry #6

Open skybristol opened 6 years ago

skybristol commented 6 years ago

Work with one or more methods of generating PROV from the workflows to register and process spatial features in https://github.com/usgs-bis/sfr.

skybristol commented 6 years ago

After starting to fiddle with this, the first use case I'm exploring is a workflow to process the US National Vegetation Classification into a new database format (document based). It's something I have to get worked up anyway. I've started using the PROV Python library to poke at the process of assigning namespaces within context, basing some of it on the nicely compiled PAV ontology, which seems reasonable for base PROV concepts (unless someone more knowledgeable wants to point me in a different direction). I've also been working on the best way to identify the file entities being processed in this case out of our ScienceBase repository and breaking up my clunky code to better identify the handful of higher level agents that we would want to record in a PROV trace. I'll build and store PROV (prov-n or json-ld) locally at this point with an eye toward work from @narock and and @fils on Provisium and some process to either register an API for harvesting/indexing or an input API to send something in.

I'll make the code for this public as soon as I clean up some repo messiness and make the underlying data public.

narock commented 6 years ago

@skybristol have you seen this link: https://docs.google.com/spreadsheets/d/1gsyuKY-xD1yQaHD6QVB6vJ3YmDWL9WKPiSopW8BtuB8/edit#gid=0

It's the PROV-ES mapping of PROV concepts to NASA dataset terminology. I'm not familiar with the dataset you're working on. Not sure how it relates to NASA versioning terminology, but you may find the terms helpful to identify file entities.

dgarijo commented 6 years ago

@skybristol, have you considered trying prov-o-matic? It's a method for capturing provenance from python notebooks: https://github.com/Data2Semantics/prov-o-matic I have been wanting to try it for a while, because in theory it allows you avoiding having to create your own trace as with the PROV python library.

Also, if you generate JSON-LD it would be easier for me to have a look. If you prefer to share the ongoing work, I can help creating the trace as well.