DARIAH-DE / DARIAH-DKPro-Wrapper

Wrapper for DKPro Core to extract lingustic information from books.
http://dariah-de.github.io/DARIAH-DKPro-Wrapper
Apache License 2.0
16 stars 8 forks source link

Hyphenation parser fails on french patterns #27

Closed thvitt closed 7 years ago

thvitt commented 7 years ago

When loading the french hyphenation file, the parser fails with this stack trace:

Caused by: java.lang.ArrayIndexOutOfBoundsException: 339
        at net.davidashen.text.Hyphenator$Scanner.read(Hyphenator.java:448) ~[ddw-0.4.7-SNAPSHOT.jar:?]
        at net.davidashen.text.Hyphenator$Scanner.cc2pat(Hyphenator.java:477) ~[ddw-0.4.7-SNAPSHOT.jar:?]
        at net.davidashen.text.Hyphenator$Scanner.getSym(Hyphenator.java:386) ~[ddw-0.4.7-SNAPSHOT.jar:?]
        at net.davidashen.text.Hyphenator.loadTable(Hyphenator.java:56) ~[ddw-0.4.7-SNAPSHOT.jar:?]
        at de.tudarmstadt.ukp.dariah.annotator.HyphenationAnnotator.initHyphenator(HyphenationAnnotator.java:140) ~[ddw-0.4.7-SNAPSHOT.jar:?]
        at de.tudarmstadt.ukp.dariah.annotator.HyphenationAnnotator.process(HyphenationAnnotator.java:152) ~[ddw-0.4.7-SNAPSHOT.jar:?]
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48) ~[ddw-0.4.7-SNAPSHOT.jar:?]
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385) ~[ddw-0.4.7-SNAPSHOT.jar:?]
        ... 8 more

The relevant line fails with cc == 339 == 0x0153 (œ)

thvitt commented 7 years ago

Certain hyphenation tables use encoding other than ISO-8859-1. To facilitate translation from that particular encoding to UCS, a list of codes and their unicode values can be passed to the hyphenator. See ruhyphal.tex, koicodes.txt for an example of a KOI8-R-encoded hyphenation table and a list of codes. [TeXHyph-J]

HyphenAnnotator initializes this to a 256-byte 1:1 table, but we ship utf-8 encoded files, so there.

We should probably just ignore that table by default.

thvitt commented 7 years ago

Fixed in 817586b