WebCuratorTool / webcurator

The root of the webcurator tool project, containing all modules needed to run a fully functional webcurator tool.
Apache License 2.0
2 stars 1 forks source link

Handling of .open files by harvest agent might lead to data loss #51

Open hannakoppelaar opened 2 years ago

hannakoppelaar commented 2 years ago

Once a harvest completes the harvest agent retrieves warc files, log files and reports from the Heritrix job dir. If it encounters a warc.open file (which is still being written to by Heritrix), it will wait a fixed amount of time. After the timeout it will simply move on and ignore any remaining warc.open file: it will only send warc-files with a .warc suffix to the store. Once it's finished sending files, it deletes the Heritrix job directory, including the warc.open file that it ignored (and that in the meantime may even have become a regular warc file), thus leading to data loss.