earthcubearchitecture-project418 / CDFSemanticNetwork

A Semantic Network built from the structured data on the web offerings of the EarthCube CDF members
1 stars 0 forks source link

lastmod for creat and update issues #2

Open fils opened 5 years ago

fils commented 5 years ago

Approaches to incremental updates based on provider clues.

Just some simple thoughts in the open on approaches to incremental and update events. NEW is simple of course, but UPDATE and DELETE are worth some discussion.

New items

To address the issue of records that are NEW we can simply look for items in a sitemap that we have not seen before.

Updated Items

To see that an items is updated, we could leverage the lastmod node in the sitemaps spec. (ref https://www.sitemaps.org/protocol.html ) Based on this we could tell if a record has been updated.

Updated Items alt approach

A simple alternative is to keep the sha256 value of a data graph. However, a provider could change the layout or structure of a data graph without altering the actual content of the data. This could be addressed (I think) via RDF normalization (ref https://json-ld.github.io/normalization/spec/index.html ) which at the level of landing page RDF would be quick. The results would then a useful input to a sha value for comparison.

Delete items

One could simply say an item no longer in a sitemap is now deleted. Typically we would not expect this to happen once a data set reaches this stage but it could. There may be far more to this topic though.

fils commented 5 years ago

As a follow up to this another path is simply to use the SHA hash of a normalized [1] JSON-LD data graph.

This requires no special work or attention by a data provider and should work in the FAIR patterns for data and associated metadata edits.

[1] https://json-ld.github.io/normalization/spec/