commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

WAT generator: do not fail on missing WARC-Filename in warcinfo record #23

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

(reported by @Xue-Alex, see the discussion in the Common Crawl group)

The WAT/WET generator fails on WARC files which contain a "warcinfo" with no "WARC-Filename" header:

java.io.IOException: No Envelope.WARC-Header-Metadata.WARC-Filename found.
        at org.archive.extract.WATExtractorOutput.extractOrIO(WATExtractorOutput.java:152)
        at org.archive.extract.WATExtractorOutput.writeWARC(WATExtractorOutput.java:170)
        at org.archive.extract.WATExtractorOutput.output(WATExtractorOutput.java:85)

The first 60 WARC files of the CC-NEWS dataset (written Aug - Oct 2016) miss this field in the warcinfo records.

However, the WAT/WET extractor should not fail because the WARC-Filename header field is optional ("may be used in ‘warcinfo’ type records").

The WARC-Filename is used to fill the WARC-Target-URI header for the corresponding metadata record. Again: this field is optional (a "‘metadata’ record may have a WARC-Target-URI field"), so it seems natural to simple leave away the WARC-Target-URI for metadata records corresponding to a warcinfo record without WARC-Filename.