Closed LaurensRietveld closed 10 years ago
Why/for whom/use cases: all unclear.
How stored: gzipped single file called dirty.gz
.
The observation is correct: regular files remain in unpacked/unarchived directories until they are cleaned. After being cleaned they should appear in their own MD5 directory and be removed from the archive directory (if all is working fine, that is).
I'm using this in debugging now to check whether e.g. duplicates are correctly removed (comparing dirty.gz
with clean.nt.gz
). Whether it will be enabled for the next non-debug/production/real crawl is something we should decide on. I would say "no" due to lack of motivation.
Jup, agree that we should leave it be for now
Right now the dirty data is stored on the file system. We should discuss this some more: Why do we want this, for whom, in which cases is it needed, and how do we store them (zipped as well?)?
Some interesting observations: roughly 90% of all directory -only- contain dirty data, and somehow not all have the filename 'dirty' (some have their regular filename instead, such as southampton-groups.rdf)