commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

WET files may include binary content if HTTP Content-Type header erroneously indicates HTML #26

Closed sebastian-nagel closed 1 year ago

sebastian-nagel commented 1 year ago

WET files may include binary content if the HTTP Content-Type header of a WARC response record indicates that the content is HTML but it actually isn't:

sebastian-nagel commented 1 year ago

Implemented the following improvements about text extraction (WET files and anchor texts in WAT files):

These changes are effective for the running September 2022 crawl (CC-MAIN-2022-40).