Closed jberkel closed 8 years ago
I guess removing Xerces fixed your parsing issue; which Wiktionary language version did you use for testing? It would be good if the modified JWKTL version could handle at least the latest English and German XML dump.
It did fix the problem, I tested with the English dump.
Running 1.8.0_45 on OSX.
I'll do a test run with the German XML dump and report back.
It does work with both, I added some integration tests which do a full import (not run by default, since they are obviously very slow).
Closed with #29
I'm not sure if I've run into bug #6 again, but the current wiktionary dump (20151102) does not parse correctly, failing with
After some investigation I think I fixed the underlying issue in Xerces (XERCES-J-1668).
However the UTF-8 handling there is quite messy and has no good test coverage. I propose to replace Xerces with something else unless there are objections.
Java's default XMLStreamReader could be a good option, it will probably more performant as well.