commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
323 stars 35 forks source link

WARC file format improvement: add WARC-Truncated header #31

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 5 years ago

The WARC standards recommends to mark records which have been truncated because of limits on the content size or fetch time by a field WARC-Truncated. Add this field and track the reason for the truncation.

sebastian-nagel commented 5 years ago

Note that only WARC files in s3://commoncrawl/crawl-data/CC-NEWS/ written since May 16, 2019 contain the WARC-Truncated header.