LOD-Laundromat / lodlaundry.github.io

http://lodlaundromat.org
2 stars 2 forks source link

Find a strategy to keep datasets up to date #79

Open RubenVerborgh opened 9 years ago

RubenVerborgh commented 9 years ago

At the moment, it seems the LOD Laundromat settles for one version of a dataset. However, dataset change over time. Should the Laundromat also update its version? This requires a mechanism to find out whether datasets are different. (Etag? File size? Last-Modified?)

Furthermore, it should be decided what is done then:

Most importantly, the Laundromat can give dataset owners an incentive to clean their data. However, if the old version remains featured prominently, this can give the wrong impression. E.g., people can spot an outdated version of a dataset on the Laundromat, and mistakenly assume the quality issues are still there, even though the original has been updated.

wouterbeek commented 9 years ago

Thanks for making this feature suggestion! Laurens and I have pondered taking dataset dynamics into account for quite a while now, but we haven't been able to pick the topic up yet. There is a project out there that does track dataset dynamics (pardon the name): http://swse.deri.org/dyldo/

The way I see it, we need the following technical innovations in order to have the LOD Laundromat take dataset dynamics into account:

  1. Today the crawling metadata is not annotated with provenance information. This meas that a new crawl of the same document will overwrite previous metadata. Obviously this has to change, for instance the statement "Document d contains 102 warnings." should become the statement "Document d contained 102 warnings on dateTime t."
  2. The wardrobe should use an updated SPARQL query in order to use the metadata of the most recent crawl.
  3. The washing machine should be notified whenever a document gets updated on the Web. HTTP headers are known to be unreliable to the extent that they cannot be used to base this notification service on. A more reliable way to notify that a document has changed may be to download it, take some unlikely-to-collide hash of its binary contents (should apply to archives and text files alike), and compare that has to an earlier version. This is cheaper than a full file comparison. Maybe the hash could be computed within the download stream, i.e., without hitting the disk at any point in time?
  4. It would be interesting to parse the HTTP headers you mention and store them as part of the crawling metadata. This would allow us to check for the level of conformance to Web standards regarding this particular topic. (This would be part of a more granular operationalization of existing, high-level quality criteria such as the 5-star model.)

All in all, I believe the above changes are realistic for a potential future version of the LOD Laundromat (see milestone).

PS: A more difficult, related topic is that of archiving the LOD Cloud. Saving multiple versions of the same data document over time requires (1) a technique for storing only the diff between consecutive files, and (2) a generous amount of disk space we are currently lacking.