Preservation Metadata - Githubissues

subotic commented 6 years ago

We need to calculate and store fixity information for Knora resources. This is needed for the data repository side of Knora, so that we are able to check and prove that resources were not altered or corrupted.
We need to automatically and periodically perform checks

lrosenth commented 6 years ago

This has to be done for each version of a resource – as we have versioning…

Am 03.05.2018 um 16:21 schrieb Ivan Subotic notifications@github.com<mailto:notifications@github.com>:

We need to calculate and store fixity informationhttps://www.dpconline.org/handbook/technical-solutions-and-tools/fixity-and-checksums for Knora resources. This is needed for the data repository side of Knora, so that we are able to check and prove that resources were not changed.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/dhlab-basel/Knora/issues/843, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFN9zJJyh5VZ1rlHQHRUfBBFupfhRoAaks5tuxJWgaJpZM4TxLRo.

benjamingeer commented 6 years ago

But we don't have versions of resources, only versions of values.

subotic commented 6 years ago

But we don't have versions of resources, only versions of values.

Yes, we don't have explicit versions of resources, but implicitly (I think), any change to a value creates a new version of a resource.

Yesterday, I had a long conversation with @lrosenth. This is the summary in very broad strokes. This is just a first broad draft and we still need to discuss if it is feasible:

Checksum: to calculate the checksum of a set of triples, calculate the checksum of each triple and then use a combining function. This is what I have used in my PhD based on this paper: https://pdfs.semanticscholar.org/e497/56a0bf7bcf6ce4b033c4f5261b283d0be394.pdf
On every value change, we calculate the checksum of the resource. The checksum being a sum of the previous version plus the checksums of triples of the new value.
At the same time, a new ARK id is generated (resource ID + timestamp).
The resource IRI, ARK id, and checksum are stored away somewhere (separate graph maybe).
This fixity information is also replicated to a separate server, and from there only available read-only. We need to make sure, that fixity information is not only stored together with the data, where both could be manipulated at the same time.
The goal is to have a gapless log of every change to a resource backed by the checksum. We need to provide evidence, that a resource didn't change inadvertently over time.
I'm not sure, how this will work (if at all) if we make changes to the data model and need to change the data.

subotic commented 6 years ago

I'm not sure, how this will work (if at all) if we make changes to the data model and need to change the data.

Ok, now I'm definitely sure that this will not work. Any change to the data model that requires changes to the data, will render all checksums invalid.

@lrosenth Do we need to make our life so hard and try to build a system that is at the same time a VRE and a Long-Term Data Archival Repository? Can't we separate those two? Basically, have an additional layer, which is read-only that stores the data and the checksums on every change, but allows us to recreate the repository for any point in time? Basically a "backup on steroids" solution. That way we could do whatever is needed for running the VRE in the upper VRE layer while being able to preserve any changes in the lower Repository layer.

subotic commented 6 years ago

We also don't need to reinvent the wheel in regards to the data model for preservation metadata. The Library of Congres has a well-established standard called PREMIS for which they also have an OWL ontology.

dasch-swiss / dsp-api

Preservation Metadata #843