cltl / FormatConversions

Several conversions between formats that are commonly used by our tools
Apache License 2.0
4 stars 1 forks source link

mmax2raw.py ignores sentence boundaries #2

Open MPvHarmelen opened 5 years ago

MPvHarmelen commented 5 years ago

Alpino expects every sentence of tokenized input to be on a separate line (Alpino User Guide, Section 2.5 on page 7). mmax2raw.py, however, totally ignores this and outputs all words of a MMAX file on a single line of the output file:

https://github.com/cltl/FormatConversions/blob/6810be2584b193fbf6624850dfe90371f79e1649/mmax2conll/mmax2raw.py#L81

Fixing this could improve tagging and therefore coreference results. Alpino also sometimes seems to output cyclic graphs as dependency-"trees". Maybe this is also caused by this issue.

MPvHarmelen commented 5 years ago

Maybe this is done on purpose, to have a more realistic assessment of the performance.