commoncrawl / nutch

Common Crawl fork of Apache Nutch
Apache License 2.0
27 stars 2 forks source link

[WARC writer / protocol-okhttp] WARC-Truncated header issues and improvements #10

Closed sebastian-nagel closed 5 years ago

sebastian-nagel commented 5 years ago

There are some oddities how truncated captures are recorded in WARC files. See also Henry Thompson's report and the discussion in the Common Crawl user group.

sebastian-nagel commented 5 years ago

For analysis and verification see

sebastian-nagel commented 5 years ago

Implemented and fixed for August 2019 crawl (CC-MAIN-2019-35). Solution verified on 100 randomly selected WARC files.

tfmorris commented 8 months ago

Implemented and fixed for August 2019 crawl (CC-MAIN-2019-35). Solution verified on 100 randomly selected WARC files.

The above notebook is now here