Closed sebastian-nagel closed 5 years ago
For analysis and verification see
Implemented and fixed for August 2019 crawl (CC-MAIN-2019-35). Solution verified on 100 randomly selected WARC files.
Implemented and fixed for August 2019 crawl (CC-MAIN-2019-35). Solution verified on 100 randomly selected WARC files.
The above notebook is now here
There are some oddities how truncated captures are recorded in WARC files. See also Henry Thompson's report and the discussion in the Common Crawl user group.
Content-Encoding: gzip