commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

WARC file format fix: add WARC-IP-Address #29

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 5 years ago

Should add the remote target IP address as field "WARC-IP-Address" to CC-NEWS response records. Thanks, @wumpus for detecting this!

sebastian-nagel commented 5 years ago

Note that only WARC files in s3://commoncrawl/crawl-data/CC-NEWS/ written since May 16, 2019 contain the WARC-IP-Address header.