The WAT/WET generator fails on WARC files which contain a "warcinfo" with no "WARC-Filename" header:
java.io.IOException: No Envelope.WARC-Header-Metadata.WARC-Filename found.
at org.archive.extract.WATExtractorOutput.extractOrIO(WATExtractorOutput.java:152)
at org.archive.extract.WATExtractorOutput.writeWARC(WATExtractorOutput.java:170)
at org.archive.extract.WATExtractorOutput.output(WATExtractorOutput.java:85)
The first 60 WARC files of the CC-NEWS dataset (written Aug - Oct 2016) miss this field in the warcinfo records.
However, the WAT/WET extractor should not fail because the WARC-Filename header field is optional ("may be used in ‘warcinfo’ type records").
The WARC-Filename is used to fill the WARC-Target-URI header for the corresponding metadata record. Again: this field is optional (a "‘metadata’ record may have a WARC-Target-URI field"), so it seems natural to simple leave away the WARC-Target-URI for metadata records corresponding to a warcinfo record without WARC-Filename.
(reported by @Xue-Alex, see the discussion in the Common Crawl group)
The WAT/WET generator fails on WARC files which contain a "warcinfo" with no "WARC-Filename" header:
The first 60 WARC files of the CC-NEWS dataset (written Aug - Oct 2016) miss this field in the warcinfo records.
However, the WAT/WET extractor should not fail because the WARC-Filename header field is optional ("may be used in ‘warcinfo’ type records").
The WARC-Filename is used to fill the WARC-Target-URI header for the corresponding metadata record. Again: this field is optional (a "‘metadata’ record may have a WARC-Target-URI field"), so it seems natural to simple leave away the WARC-Target-URI for metadata records corresponding to a warcinfo record without WARC-Filename.