Add encoding detection to WET text extraction

commoncrawl / ia-web-commons

Web archiving utility library

Apache License 2.0

9 stars 6 forks source link

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

The WET text extraction assumes always UTF-8, it should rely on a robust charset detection instead. See the discussions .wet file encoding and problem with East European encodings in WET files.

sebastian-nagel commented 7 years ago

Tested with a set of sample WARC files (wet_encoding_test.zip) - Japanese, Polish, Czech, Hungarian, Russian, Turkish, German with various encodings.