dkpro / dkpro-jwktl

Java Wiktionary Library
http://dkpro.org/dkpro-jwktl/
Apache License 2.0
57 stars 25 forks source link

Replace Xerces #28

Closed jberkel closed 8 years ago

jberkel commented 8 years ago

I'm not sure if I've run into bug #6 again, but the current wiktionary dump (20151102) does not parse correctly, failing with

org.xml.sax.SAXParseException; lineNumber: 99849520; columnNumber: 47; Invalid byte 2 of 4-byte UTF-8 sequence.

After some investigation I think I fixed the underlying issue in Xerces (XERCES-J-1668).

However the UTF-8 handling there is quite messy and has no good test coverage. I propose to replace Xerces with something else unless there are objections.

Java's default XMLStreamReader could be a good option, it will probably more performant as well.

chmeyer commented 8 years ago

I guess removing Xerces fixed your parsing issue; which Wiktionary language version did you use for testing? It would be good if the modified JWKTL version could handle at least the latest English and German XML dump.

jberkel commented 8 years ago

It did fix the problem, I tested with the English dump.

Running 1.8.0_45 on OSX.

I'll do a test run with the German XML dump and report back.

jberkel commented 8 years ago

It does work with both, I added some integration tests which do a full import (not run by default, since they are obviously very slow).

jberkel commented 8 years ago

Closed with #29