Closed KonradHoeffner closed 10 years ago
This is semi-solved so feel free to reopen if necessary.
The parsers have now been updated to handle umlauts and recompiled using JavaCC to Java.
TOKEN: {<WORD: (["a"-"z"]|["ä"]|["ö"]|["ü"]|["ß"]|["Ä"]|["Ö"]|["Ü"]|["À"]|["à"]|["Â"]|["â"]|["Æ"]|["æ"]|["Ç"]|["ç"]|["È"]|["è"]|["É"]|["é"]|["Ê"]|["ê"]|["Ë"]|["ë"]|["Î"]|["î"]|["Ï"]|["ï"]|["Ô"]|["ô"]|["Œ"]|["œ"]|["Ù"]|["ù"]|["Û"]|["û"]|["Ÿ"]|["ÿ"]|["0"-"9"]|["?"]|["-"]|["_"]|["!"]|[","]|[";"]|["."]|[":"]|["/"])+>}
However this doesn't seem to work so Christina did some workarounds to made the parser get a normalized input without special characters while the full text still goes to the NER and Tagger before.