UUDigitalHumanitieslab / kbkrant-harvest

working environment for harvesting the Databank Digitale Dagbladen
BSD 3-Clause "New" or "Revised" License
0 stars 1 forks source link

Unpack and sort by year #5

Open jgonggrijp opened 5 years ago

jgonggrijp commented 5 years ago

Currently, each newspaper is saved as a single tarball containing an uncompressed OCR file for each article as well as the gzipped metadata file. These tarballs are evenly spread over a hundred directories, using the last two digits of the newspaper ID as the grouping criterion (#2).

We need to be able to find newspapers by year. The newspapers are not evenly spread over the years, so each year will have to be further divided into directories in order to prevent filesystem issues. If we're doing that anyway, we may as well organize by date within each year for added human-friendliness.

@BeritJanssen I propose to make combined month-date directories, so a flat list of 365 directories within each year (01-01, 01-02, ... 12-30, 12-31). This saves some filesystem overhead and should still be human-friendly enough, I think. Do you agree?

After reordering, it will likely still be desirable to be able to find newspapers by their ID. I'll leave a symlink from the old location to the new location for this purpose. Using ls -l, this will even make it possible to quickly look up the date for a given newspaper ID.

It is also more practical to have the newspapers as directories rather than as tarballs. This does inflate the disk usage by a factor of about 5, but there is still space for this.

jgonggrijp commented 5 years ago

I kicked off the script in a screen session. Fingers crossed!

jgonggrijp commented 5 years ago

Just checked, still running smoothly. A find tally suggests that the script is at about 4.5%.

@BeritJanssen That puts the estimated finish time around Sunday 24. February, 12.30 CET. I didn't realize just sorting the files by date would take so much time!