commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
319 stars 34 forks source link

Do not use "http/2" protocol version in HTTP headers in WARC files #42

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using HTTP/2 after a Java security upgrade which included ALPN and therefor allowed for HTTP/2. The crawler started to use HTTP/2 after an automatic restart.

The mentioned WARC files may cause WARC readers (eg. jwarc) to fail while parsing the HTTP headers:

To address the issue:

Affected files:

s3://commoncrawl/crawl-data/CC-NEWS/2020/09/CC-NEWS-20200912083952-00000.warc.gz
...
s3://commoncrawl/crawl-data/CC-NEWS/2020/10/CC-NEWS-20201004110027-00339.warc.gz

More than 80% of the records are captured using HTTP/2.

jnioche commented 11 months ago

This will be fixed when the NewsCrawler is ported to StormCrawler 2.x. The fix is available since StormCrawler 2.7.

sebastian-nagel commented 3 months ago

The fix is available since StormCrawler 2.7.

See apache/incubator-stormcrawler#1010