Open jeanetteclark opened 5 years ago
Thanks for chatting about this earlier today and writing this up.
Stepping back, I see two general patterns: (1) Keeping PROV by default and (2) dropping PROV by default unless certain scenarios are met. You've outlined option (2) which I think is a reasonable approach. If we go that route, I think we want two things:
arcticdatautils
to not drop the PROV when it otherwise would. Something like a dont_drop_prov_plz = TRUE
argument to publish_update()
.warning()
/stop()
/etc options.I think, for (2), we could go with a warning()
by default and print out a super useful message telling the user exactly how to resurrect any PROV such as:
Hey, you lost all of your PROV statements because the PROV that existed
in the previous version of this Data Package referenced Objects that
aren't present in the updated version. You can override this in the
future by adding `dont_drop_prov_plz = TRUE` and you can also recover
the lost PROV by running:
prov_statements <- get_prov({paste in the old resource map PID here for the user})
update_package(mn, {paste in new rm pid}, other_statements = prov_statements)
Does something like that seem like a good solution?
Oh, and as a general note, the way I've been thinking about this isn't retaining PROV statements but actually being able to split the statements in the ORE into two groups: (1) Data Packaging statements and (2) all other statements. So in this ticket, any time I/we say PROV statements we'd mean "other statements". I think this is a better approach for a few reasons:
When updating a document like an RDF resource map, I think the default should be to preserve its contents and not be lossy unless there is a specific reason to drop something. Losing RDF triples should be considered a bug, as we will likely add other triples to our ORE docs over time. We shouldn't lose metadata every time it travels through the utils package, as that introduces the need for lots of manual fixes and people might forget to add it all back in. I think the default behavior for our R (and other) packages should be:
1) Start from the existing resource map with all triples 2) add aggregation related triples for new or updated content, or remove them for deleted content 3) add or remove PROV triples as needed for updated or deleted objects (this should preserve all triples for objects that are unchanged)
This does require the software to have a model of a package that understands its components. The datapack::DataPackage
class I think has a lot of this built in.
@mbjones, I'm a little confused about your 3rd point since it seems in conflict with your comments here, but maybe I am misunderstanding it.
Let's say we have a package with the following PROV trace:
OBJECT_2 <---derivedFrom--- OBJECT_1 (using a SCRIPT during an EXECUTION)
If OBJECT_2 is updated with a new version of the file from a different execution of the same script, I think it is clear that you cannot blindly add PROV triples associated with the new object (this was the conclusion we came to over a year ago).
If we instead only remove the PROV triples associated with OBJECT_2 (because OBJECT_2 is not included in the data package anymore) and don't make any assumptions about how the new version of the object fits in, the triples would look like this, with the last three rows dropped.
subject | predicate | object | drop |
---|---|---|---|
EXECUTION | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://purl.dataone.org/provone/2015/01/15/ontology#Execution | FALSE |
EXECUTION | http://www.w3.org/ns/prov#qualifiedAssociation | _r1561048296r6751r1 | FALSE |
SCRIPT | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://purl.dataone.org/provone/2015/01/15/ontology#Program | FALSE |
_r1561048296r6751r1 | http://www.w3.org/ns/prov#hadPlan | SCRIPT | FALSE |
EXECUTION | http://www.w3.org/ns/prov#used | OBJECT_1 | FALSE |
OBJECT_1 | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://purl.dataone.org/provone/2015/01/15/ontology#Data | FALSE |
OBJECT_2 | http://www.w3.org/ns/prov#wasGeneratedBy | EXECUTION | TRUE |
OBJECT_2 | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://purl.dataone.org/provone/2015/01/15/ontology#Data | TRUE |
OBJECT_2 | http://www.w3.org/ns/prov#wasDerivedFrom | OBJECT_1 | TRUE |
I'm not sure I know enough about the PROV model to say whether the result of dropping these triples leaves us with a valid trace or not. We should also consider whether this approach would work if we dropped all of the triples associated with a script that got updated. From a user perspective it would make it much easier to update their provenance because they would only have to add it back in for files that they updated, as opposed to all of the files.
We shouldn't lose metadata every time it travels through the utils package, as that introduces the need for lots of manual fixes and people might forget to add it all back in.
I agree, and that's the way we had it when we started inserting PROV into DataONE packages, but there's plenty of cases where migrating PROV forward doesn't make sense (which we've outlined here and elsewhere). @jeanetteclark 's example above is a good example of this.
Re:
- add aggregation related triples for new or updated content, or remove them for deleted content
- add or remove PROV triples as needed for updated or deleted objects (this should preserve all triples for objects that are unchanged)
This sound good but it doesn't sound like a good fit for the arcticdatautils
API which is currently totally ignorant of stuff like this. I feel like, if we really want to have smarter resource map processing, we might wanna lean on dataone
and datapack
which implements a read-modify-write pattern whereas arcticdatautils
implements a read-modify
pattern and isn't aware of your intended changes.
Yeah it is not set up very well to do this kind of thing at all at the moment. This is definitely something we should consider when we start work on refactoring the R packages.
In the meantime...we should find some kind of solution for arcticdatautils
. @amoeba I liked your solution above, it should be fairly easy to implement in the package
@amoeba I finally got around to this and have a tested solution. Do you mind checking it out here:
https://github.com/jeanetteclark/arcticdatautils/tree/carry_prov
@dmullen17 it would be good if you had a look too. If we like this, I can create a pull request for it for more formal review
I have tests written up but to play with the functionality install the package from that branch and then play around with:
mn <- getMNode(CNode("STAGING"), "urn:node:mnTestARCTIC")
package <- create_dummy_package(mn, size = 3)
# add dummy prov to package
package_prov <- suppressMessages(add_dummy_prov(mn, package$resource_map))
# publish a new data object
data_new <- create_dummy_object(mn)
# Publish an update on it and observe output based on what data objects you include, whether to keep the prov, etc
update <- publish_update(mn,
package$metadata,
package_prov,
data_new,
keep_prov = FALSE,
check_first = FALSE)
Hey @jeanetteclark, thanks for putting this together. The warning with example code is 💯 btw.
Is defaulting to removing provenance (keep_prov = FALSE
) what we want here? For the common cases, using publish_update
to give a package a DOI or otherwise updating the metadata, provenance should probably get forwarded to the new package's resource map since it's still accurate. I'd have to scratch my head a bit more to think out how to integrate this in your flow here but it seems doable.
PS: I had a few other comments that'd be suitable during code review I could make when you file a PR.
Certainly something that could be up for debate! I think what you describe is already integrated into my workflow, just need to change the default arg. I'll create a PR and we can see where we get from there
(carryover from this issue: https://github.com/NCEAS/metacatui/issues/310)
Scenario A: update package with new object not part of existing PROV trace O1 <---derivedFrom--- O3 (using script S1 during execution E2) O4 added as new object Metadata updated
Scenario B: update metadata for a package without changing objects O1 <---derivedFrom--- O3 (using script S1 during execution E2) Metadata updated
In both situations, all prov relationships between O1 and O3 should be included in the new version of the package.
There is some code in
arcticdatautils
already that will carry forward PROV - see this commit but it needs to be expanded a bit to allow for PROV to be carried over in the scenarios above, but not in other scenarios where pids involved in the PROV trace are updated with new versions.So, before adding the carried over PROV statements from the old resource map to the new resource map, I think we need to check that the pids contained within those statements all exist in the
data_pids
argument forupdate_resource_map
If not all of the pids involved in the PROV trace exist in the
data_pids
arguement, the function will drop all of the PROV statements in the updated version of the resource map. In this case should the function: