Open ctschroeder opened 5 years ago
This is interesting. I tested thetas and looked at our machine processed data and there all is well. The problem seems to happen when you segment by hand (add pipes), and then run the NLP 'from pipes'. I can reproduce it for this input:
ⲛ|ⲑ|ⲉ_ⲉⲧ|ⲥⲏϩ
(from_pipes=True)
Does your input come from the same setup? How did you pipe/expect to pipe this case? Is what I did above a logical way for the input to be expected to look?
Sorry I did not reply earlier. I lost track of some gh threads. YES this is happening when segmented from pipes. I have been running tokenization/segmentation only first, hand reviewing, then running through NLP. ⲛ|ⲑ|ⲉ_ⲉⲧ|ⲥⲏϩ is what is the segmentation is supposed to be, with normalization being ⲛ|ⲧ|ϩⲉ_ⲉⲧ|ⲥⲏϩ
TL;DR: theta is fixed, but some other edge cases, especially t|i
could still create errors.
Details:
OK, so I've thought about this some, and this is not trivial to truly fix. The problem is that in 'from-pipes' mode we're telling the tool to trust our segmentation, rather than work stochastically. But an input ⲛ|ⲑ|ⲉ is actually ambiguous: do we mean for the tokenizer to trust us and make three single letter norms, or do we want it to be intelligent and think it might be tao-hori?
In e93de58 it's now fixed for thetas in a kind of heuristic way: all boundaries following a theta are automatically assumed to be a tao-hori split underlyingly. I think this is almost never wrong (I think you'd need a compound with a Greek modifier ending in theta for this to be wrong, so virtually never).
But the problem persists for t|i spelled as one letter, and maybe some other special cases, specifically tetna = tetn|na. If you input:
For now, these cases will be segmented literally in 'from pipes' mode, since I think fixing them à la theta would cause more harm than good (most cases of ϯ are first person or the verb 'give').
Also note this will not affect our automatically processed corpora, since there the tokenizer makes stochastic decisions anyway, and should be mostly right - it's just in the deterministic mode where we tell it 'trust my pipes'. Making the tokenizer trust them selectively is a possible future solution, but goes substantially outside the current architecture.
This makes a lot of sense. Thanks. I'm trying to think about how to optimize this. Are you planning on making more changes to the tools using deep learning methods? I think it still may be more accurate to check pipes first for manually edited corpora; the theta's and ti's are not that frequent. Hmmm.
The current main target for new learning approaches is normalization, but there may be other opportunities in the future as well.
Note that thetas are 99.9% fixed with this commit, it's just cases like tioudaia and tetna- that are not addressed, and those are infrequent enough that correcting pipes is definitely still the best option.
BTW another really simple option is using a special notation for these cases (not a pipe)
The NLP pipeline is having trouble with thetas. This will produce a lot of errors across any corpus that's only machine-processed or even machine-processed after segmentation has been checked. Three examples from Johannes canons docs, which show machine processing in the spreadsheet after segmentation was manually corrected in the XML window:
In this image check out 1) athēt in morph layer and 2) nthe in the norm + norm_group (h is missing), lemma, and pos layers.
In this image see also the missing h in norm + norm group and thus the wrong lemma and pos tags.
Thanks!!