Closed sebastian-nagel closed 7 years ago
The WET text extraction assumes always UTF-8, it should rely on a robust charset detection instead. See the discussions .wet file encoding and problem with East European encodings in WET files.
Tested with a set of sample WARC files (wet_encoding_test.zip) - Japanese, Polish, Czech, Hungarian, Russian, Turkish, German with various encodings.
The WET text extraction assumes always UTF-8, it should rely on a robust charset detection instead. See the discussions .wet file encoding and problem with East European encodings in WET files.