chrisdrymon / angel

An Ancient Greek Morphology Tagger
https://pypi.org/project/angel-tag/
MIT License
26 stars 1 forks source link

tokens beginning with parentheses are treated entirely as punctuation #3

Closed jtauber closed 3 years ago

jtauber commented 3 years ago

e.g. from John 1.38 in MorphGNT SBLGNT I get:

('(ὃ', 'u--------')
chrisdrymon commented 3 years ago

Oh that's a great catch! I wondered how it tagged some punctuation wrong in the confusion matrix. It'll be fixed in the next commit.

jcuenod commented 3 years ago

Maybe allow the use of a custom tokenizer? I found in Shepherd of Hermas (27.3.1). There are also colons in there but they're in the Latin sections.

jtauber commented 3 years ago

personally, I would just preprocess. In many cases the text will be in XML or some other format anyway so will require preprocessing. My run on John's Gospel involved preprocessing (although in that case it was concatenating an existing tokenization into a single string for the book)

jcuenod commented 3 years ago

That's fair, although using custom tokenizers seems pretty common practice in ML.

jtauber commented 3 years ago

Yes, but that's in large part because they're dealing with much more text and aren't as interested in spending a lot of time on any one text (unlike us :-))