commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

Add HTTP protocol version to HTTP request message #34

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

The request records in the CC-NEWS WARC files lack the HTTP protocol version:

GET /path 

instead of

GET /path HTTP/1.1

This makes some WARC parsers fail to process the WARC files, see https://groups.google.com/d/msg/common-crawl/hsb90GHq6to/Lv-9-nHAAQAJ.

sebastian-nagel commented 4 years ago

Fix in Stormcrawler (DigitalPebble/storm-crawler#775) deployed to production, WARC files now contain the HTTP version in the request message.