DataONEorg / rdataone

R package for reading and writing data at DataONE data repositories
http://doi.org/10.5063/F1M61H5X
36 stars 19 forks source link

Add prov to an execution retroactively #168

Closed gothub closed 7 years ago

gothub commented 7 years ago

As described in https://github.com/DataONEorg/sem-prov-design/issues/228, a function will be added that will insert provenance relationships into a DataPackage for a script and the files that it has read and written. The proposed call:

addRunProv(x, programFile, inputFiles, outputFiles, EMLfile)

with parameters:

The function will insert the provenance relationsships that are required by the DataONE RDF/XML indexing subprocessor in order for the prov relationships to be properly indexed.

Where should this function be placed in the DataONE package? It doesn't really fit in D1Client.R, so do we need a new S4 class?

It doesn't quite make sense to place this in the R datapack package either - we are modifying a DataPackage here, but these are DataONE provenance relationships that are being added, which the R DataONE package should have knowledge of, not datapack.

sycao5 commented 7 years ago

So this new function will be in the R DataONE package, right?

gothub commented 7 years ago

Yes, that is the plan, unless someone has reasons why it should be in another package.

mbjones commented 7 years ago

I think this would be best in datapack::DataPackage, as that is the container for the RDF and all of the components. It especially makes sense as x is of class DataPackage, and would allow the provenance info to be added to the package just like the existing statements are, and is similar to the existing datapack::insertRelationship() method. datapack is already RDF and provenance aware. This new method would be a higher-level version of that, but would insert multiple relationships at once.

Are inputFiles and outputFiles intended to be vectors of files, vectors of identifiers, or both? We definitely need to be able to do it via vectors of identifiers. If so, maybe the parameters should be renamed to inputIdentifiers and outputIdentifiers? If they are files, does the function add the files to the package as well? Needs discussion.

gothub commented 7 years ago

Yes, datapack stores the prov relationships, but I don't think it should contain the knowledge of the ProvONE data model, i.e. an execution is linked to a plan via a qualified association. etc. If the data model changes then dataone has a dependency on datapack to change.

Regarding the inputFiles and outputFIles - this function is handling the use case where a user has a collection of files that are the artifacts of an execution that has already run. The function would be used to build a DataPackage from the scripts, input and output files for such a run, so it would take care of assigning pids to DataObjects.

Maybe there are other use cases that we need to consider.

gothub commented 7 years ago

This functionality was added to datapack in commit 9afb17209134d5b5e4a3d7061daed333835f86ac.