KELLIA / dictionary

The dictionary comprised of the Coptic lexicon created by the BBAW and interface by Coptic SCRIPTORIUM. Currently deployed at https://coptic-dictionary.org
28 stars 12 forks source link

ANNIS link to "lemma" or rather to "norm"? #65

Open dwerning opened 5 years ago

dwerning commented 5 years ago

(1)

@phoenix-mossimo and I noticed that, as it is now, all ANNIS links look for "lemma" (based on the pure form/string) in ANNIS: TLA "form" => ANNIS "lemma", Wouldn't it make more sense to look in ANNIS for "norm", which seems to correspond to form in the BBAW/DDGLC lexicon: TLA "form" => ANNIS "norm" (I making sense, the link could look for [TLA form=>ANNIS norm] and [TLA lemma =>ANNIS lemma] at the same time?)

To add an ANNIS link for ANNIS "lemma" we should have a link from the TLA "lemma" form (Cxxxx). However, currently, there is no place for this ANNIS "lemma" link icon. I suggest to add it before the a lemma section in the light blue box before the form list, e.g. Lemma | TLA lemma ID | ANNIS ⲕⲁⲧⲁ | C9360 | [ANNIS "lemma" links]

Form | Dial. | TLA form ID | POS | ANNIS ⲕⲁⲧⲁ | L | CF23566 | Präp. |  [ANNIS "norm" links]   ⲕⲁⲧⲁ | S | CF23567 | Präp. |  [ANNIS "norm" links]   ...

PS: I am aware that the lemma information appear three times then on the page (title, box, citation). But that totally makes sense to me (same in articles: the topic appears in the title, the text (here: box), and in the citation of the article).

(2)

In ANNIS the links back to CDO may start from not only "lemma" forms (e.g. ⲉⲣⲉ , ⲛⲧⲟϥ , ⲕⲱⲱⲥ) in the annotation view, but also from "norm" forms (e.g. ⲉ , ϥ , ⲕⲟⲟⲛⲥ).

amir-zeldes commented 5 years ago

This is a complex issue - there are already several kinds of searches from CDO to ANNIS and back:

  1. By default, items link to lemma="xyz" since we assume most entries reflect uninflected items
  2. Items with oRef link to norm="x" . norm="y" because we assume they reflect potentially inflected items, though this may not be true
  3. Items with whitespace in the entry but no oRef link to norm_group=/.*xyz.*/ in the hope that a query to a complex item with unclear division might match a bound group

All of these are heuristics meant to catch 'what the user might mean', but they are by no means perfect. Conversely, ANNIS links to CDO in 3 ways at present:

  1. lemmas link to a search for the lemma
  2. multiword items in the ANNIS corpora link to a CDO search for the items in sequence, where the annotated string in the ANNIS multiword annotation is guaranteed to match the spelling of CDO entries (the mwe recognizer uses the CDO xml file)
  3. morphologically complex items (ANNIS morph) link to the respective sub-part entries (e.g. mnt-, etc.)

This too is not exhaustive. I think there is room to discuss more how we could handle things, but the most pressing of these I would say is making oRef a reliable indicator of entries' internal structure, and promoting parity between our tokenization models. I wouldn't mind seeing disjunction queries for lemmas and norm, but I hope this won't require another row for the lemma, since the ANNIS search is a nice bonus functionality, while I think keeping the entry clear and readable is a main objective of the entry page.

dwerning commented 5 years ago

There seems to be a partial misunderstanding. The difference between lemma and form form TLA/DDGLC lexicons perspective is not (primarily) one of [person/number/gender, I understand] "inflection" but of form/spelling difference, e.g. lemma: "ⲉⲣⲉ" ; forms: ⲉⲣⲉ , ⲉ
lemma: "ⲕⲱⲱⲥ" ; froms: ⲕⲱⲱⲥ , ⲕⲱⲱⲥⲉ , ⲕⲱⲛⲥ , ⲕⲟⲟⲛⲥ The CDO mentions TLA "forms", but links to ANNIS "lemma", not to "norm"(="form"). The issue however seems to be quite tricky since TLA seems to analyze as different "lemma" what ANNIS takes as different forms of one single lemma (e.g., ⲉⲣⲉ vs. ⲉ), and the other way around : ANNIS seems to analyze as different "lemma" what TLA takes as different forms of one single lemma (e.g., ⲕⲱⲱⲥ vs. ⲕⲱⲛⲥ). I still think this occasional mismatch from obviously both sides is best resolved by linking TLA forms to ANNIS norms: so following the ANNIS link you really get the statistics about this form (not lemma).

phoenix-mossimo commented 5 years ago

@amir-zeldes: how about extending the regular query, which is sent to Annis, with "norm"? i.e. :

lemma=‎"ⲕⲱⲱⲥ‎" | norm=‎"ⲕⲱⲱⲥ‎"

This would cover the possible discrepancies between Coptic Scriptorium and CDO in the perception of the "standard" lemma form.

dwerning commented 5 years ago

but if you have a) ⲉⲣⲉ and b) ⲉ and you query

a) lemma=‎"ⲉⲣⲉ" | norm=‎"ⲉⲣⲉ" 
b) lemma=‎"ⲉ" | norm=‎"ⲉ" 

you would get no results in ANNIA for b) (remember that TLA lexicon does not connect the two, so a querry

lemma=‎"ⲉⲣⲉ" | norm=‎"ⲉ" 

is not possible to compute from the TLA lexicon alone. And the other way around, with the info from TLA lexicon a query

lemma=‎"ⲕⲱⲱⲥ‎" | norm=‎"ⲕⲱⲛⲥ" 

would be possible: however, it would give no results in ANNIS since, this time, ANNIS does not connect the two, The issue is obviously complicated.

amir-zeldes commented 5 years ago

I think this is a misunderstanding - | means "OR" not "AND", so it would not create 0 results. The risk is rather over-generation (i.e. including irrelevant results).

As for the examples above, I don't think we have different conventions for verb lemmas, i.e. ⲕⲱⲱⲥ vs. ⲕⲱⲛⲥ should be the same lemma (if these are inflected forms of the same verb). ANNIS itself doesn't do any lemmatization, it just offers search over Scriptorium data that was automatically lemmatized, and in most cases manually inspected (or in the best case of treebanked data, completely manually revised). The metadatum to look for is 'parsing', 'tagging' and 'segmentation', which have the values 'automatic', 'checked' and 'gold'. But errors inevitably occur, either due to NLP errors or human error.

As for ere/e, I think they should have the same lemma: by Scriptorium guidelines, it should the most independent form possible (so prenominal for converters or prepositions, absolute form for verbs or nouns, independent form for pronouns, etc.) - see our guidelines here:

https://github.com/CopticScriptorium/tagger-part-of-speech/blob/master/Coptic%20SCRIPTORIUM%20lemmatization%20guidelines.pdf

dwerning commented 5 years ago

Yes, a misunderstanding. I wasn't aware that "|" is the common operator ;) And I agree, the lemma=‎"XX" | norm=‎"XX" solution could give additional irrelevant results. However, simply lemma=‎"XX" would/does give no results in some cases of form variants and dialectal variants. Both is not fully satifying. However, I would personally go for your suggestion lemma=‎"XX" | norm=‎"XX". (In the end, I would like the ANNIS/CS team to decide.)