api: findWord doesn't handle enclytics/proclytics which don't use "-"

alpheios-project / arethusa

Arethusa: Annotation Environment

http://sosol.perseids.org/tools/arethusa

MIT License

34 stars 26 forks source link

api: findWord doesn't handle enclytics/proclytics which don't use "-" #800

Closed balmas closed 4 years ago

balmas commented 4 years ago

it seems there are some treebank data files (maybe only older ones prior to the Perseids tokenization workflow) that don't mark enclytics/proclytics with a hyphen. E.g. in Ovid Metamorphoses sentences 2 and 4, primaque and congestaque have been split into "que" "prima" and "que" "congesta" without using a hyphen on "que". We could test for specific known enclytic words to deal with this, or just look for merged words without hyphens.

balmas commented 4 years ago

I need to create a test page for this still.

balmas commented 4 years ago

This was verified with Arethusa unit tests.