LOD-Laundromat / lodlaundry.github.io

http://lodlaundromat.org
2 stars 2 forks source link

Dealing with dirty data #36

Closed LaurensRietveld closed 10 years ago

LaurensRietveld commented 10 years ago

Right now the dirty data is stored on the file system. We should discuss this some more: Why do we want this, for whom, in which cases is it needed, and how do we store them (zipped as well?)?

Some interesting observations: roughly 90% of all directory -only- contain dirty data, and somehow not all have the filename 'dirty' (some have their regular filename instead, such as southampton-groups.rdf)

wouterbeek commented 10 years ago

Why/for whom/use cases: all unclear.

How stored: gzipped single file called dirty.gz.

The observation is correct: regular files remain in unpacked/unarchived directories until they are cleaned. After being cleaned they should appear in their own MD5 directory and be removed from the archive directory (if all is working fine, that is).

I'm using this in debugging now to check whether e.g. duplicates are correctly removed (comparing dirty.gz with clean.nt.gz). Whether it will be enabled for the next non-debug/production/real crawl is something we should decide on. I would say "no" due to lack of motivation.

LaurensRietveld commented 10 years ago

Jup, agree that we should leave it be for now