CopticScriptorium / coptic-nlp

Coptic NLP pipeline page and utilities
Apache License 2.0
14 stars 5 forks source link

NLP having trouble with thetas #15

Open ctschroeder opened 5 years ago

ctschroeder commented 5 years ago

The NLP pipeline is having trouble with thetas. This will produce a lot of errors across any corpus that's only machine-processed or even machine-processed after segmentation has been checked. Three examples from Johannes canons docs, which show machine processing in the spreadsheet after segmentation was manually corrected in the XML window:

In this image check out 1) athēt in morph layer and 2) nthe in the norm + norm_group (h is missing), lemma, and pos layers.

Screen Shot 2019-03-11 at 4 54 58 PM

In this image see also the missing h in norm + norm group and thus the wrong lemma and pos tags.

Screen Shot 2019-03-11 at 4 54 13 PM

Thanks!!

amir-zeldes commented 5 years ago

This is interesting. I tested thetas and looked at our machine processed data and there all is well. The problem seems to happen when you segment by hand (add pipes), and then run the NLP 'from pipes'. I can reproduce it for this input:

ⲛ|ⲑ|ⲉ_ⲉⲧ|ⲥⲏϩ
(from_pipes=True)

Does your input come from the same setup? How did you pipe/expect to pipe this case? Is what I did above a logical way for the input to be expected to look?

ctschroeder commented 5 years ago

Sorry I did not reply earlier. I lost track of some gh threads. YES this is happening when segmented from pipes. I have been running tokenization/segmentation only first, hand reviewing, then running through NLP. ⲛ|ⲑ|ⲉ_ⲉⲧ|ⲥⲏϩ is what is the segmentation is supposed to be, with normalization being ⲛ|ⲧ|ϩⲉ_ⲉⲧ|ⲥⲏϩ

amir-zeldes commented 5 years ago

TL;DR: theta is fixed, but some other edge cases, especially t|i could still create errors.

Details:

OK, so I've thought about this some, and this is not trivial to truly fix. The problem is that in 'from-pipes' mode we're telling the tool to trust our segmentation, rather than work stochastically. But an input ⲛ|ⲑ|ⲉ is actually ambiguous: do we mean for the tokenizer to trust us and make three single letter norms, or do we want it to be intelligent and think it might be tao-hori?

In e93de58 it's now fixed for thetas in a kind of heuristic way: all boundaries following a theta are automatically assumed to be a tao-hori split underlyingly. I think this is almost never wrong (I think you'd need a compound with a Greek modifier ending in theta for this to be wrong, so virtually never).

But the problem persists for t|i spelled as one letter, and maybe some other special cases, specifically tetna = tetn|na. If you input:

For now, these cases will be segmented literally in 'from pipes' mode, since I think fixing them à la theta would cause more harm than good (most cases of ϯ are first person or the verb 'give').

Also note this will not affect our automatically processed corpora, since there the tokenizer makes stochastic decisions anyway, and should be mostly right - it's just in the deterministic mode where we tell it 'trust my pipes'. Making the tokenizer trust them selectively is a possible future solution, but goes substantially outside the current architecture.

ctschroeder commented 5 years ago

This makes a lot of sense. Thanks. I'm trying to think about how to optimize this. Are you planning on making more changes to the tools using deep learning methods? I think it still may be more accurate to check pipes first for manually edited corpora; the theta's and ti's are not that frequent. Hmmm.

amir-zeldes commented 5 years ago

The current main target for new learning approaches is normalization, but there may be other opportunities in the future as well.

Note that thetas are 99.9% fixed with this commit, it's just cases like tioudaia and tetna- that are not addressed, and those are infrequent enough that correcting pipes is definitely still the best option.

amir-zeldes commented 5 years ago

BTW another really simple option is using a special notation for these cases (not a pipe)