WET files may include binary content if the HTTP Content-Type header of a WARC response record indicates that the content is HTML but it actually isn't:
Implemented the following improvements about text extraction (WET files and anchor texts in WAT files):
use the WARC header WARC-Identified-Payload-Type (if available) to identify HTML content to be parsed for link and text extraction
replace ASCII control characters, line breaks and some Unicode white space characters by U+0020 during text extraction
use text extraction for WET text payload also for anchor text extraction: this will also improve the spacing when anchor texts include HTML elements (cf. #13)
increase max. anchor text length (100 -> 128 characters)
These changes are effective for the running September 2022 crawl (CC-MAIN-2022-40).
WET files may include binary content if the HTTP Content-Type header of a WARC response record indicates that the content is HTML but it actually isn't: