commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
323 stars 35 forks source link

WARC file format fix: mask HTTP header fields Content-Encoding and Transfer-Encoding, adjust Content-Length #30

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 5 years ago

The CC-NEWS contain the literal values of the HTTP header fields Content-Encoding, Transfer-Encoding and Content-Length although the payload is stored unchunked and uncompressed.

Thanks, @wumpus for detecting this!

sebastian-nagel commented 5 years ago

Note that only WARC files in s3://commoncrawl/crawl-data/CC-NEWS/ written since May 16, 2019 contain the masked headers.