Closed GoogleCodeExporter closed 9 years ago
Seems like someone has typed a non-UTF-8 character into a Wiktionary article,
which hasn't be cleaned by the database dump application. That is, there is a 4
byte character sequence in the latest.xml, which does not follow the expected
UTF-8 sequence pattern (11110xxx 10xxxxxx 10xxxxxx 10xxxxxx). Best solution is
to remove these invalid characters.
A quick&dirty hack might be
http://stackoverflow.com/questions/12999651/how-to-remove-non-utf-8-characters-f
rom-text-file
A more elaborate idea is
http://www.mkyong.com/java/sax-error-malformedbytesequenceexception-invalid-byte
-1-of-1-byte-utf-8-sequence/
I don't know what helps and if there's an easy way of stripping non-UTF-8
characters from the input file easily. Would be nice if you could report back
and potentially submit a patch, since I suspect that other users will run in
the same issue.
Original comment by chmeyer.de
on 21 May 2014 at 10:03
Probably fixed with jwktl-1.0.1. Please try again using the new version. Note
that AFAIK the UTF-8 bug still exists, but it is either ignored by the current
xerces version or the current XML dump has been fixed w.r.t. this issue.
Original comment by chmeyer.de
on 30 Sep 2014 at 12:27
Original issue reported on code.google.com by
n...@fbk.eu
on 21 May 2014 at 1:34