alpheios-project / tokenizer

Alpheios Tokenizer Service
1 stars 0 forks source link

handle latin -ne and -ve enclytics #7

Open balmas opened 4 years ago

balmas commented 4 years ago

The llt-tokenizer (https://github.com/perseids-project/llt-tokenizer) used the prometheus latin stems database (https://github.com/perseids-project/llt-db_handler) to use morphological rules to determine when -ne and -ve represented enclyctics.

If we want to handle this in the spacy tokenizer the right approach would probably be to add training data for a Latin model and then retokenize to fix the instances that need to be fixed based upon the morph data.

For now, I've just implemented the simpler regex based handling from llt-tokenizer for the que enclytics.