KELLIA / dictionary

The dictionary comprised of the Coptic lexicon created by the BBAW and interface by Coptic SCRIPTORIUM. Currently deployed at https://coptic-dictionary.org
28 stars 14 forks source link

"Search in Annis" of compounds written separately delivers nothing #26

Open phoenix-mossimo opened 7 years ago

phoenix-mossimo commented 7 years ago

Dictionary contains a substaintial number of compounds which are written separately. Seaching for these in Annis (button "Search in Annis") delivers nothing. I quess such cases need a special treatment by a search query script.

The issue is also relevant to "multiwords", which, although written together, contain lexical items and have additional tagging in dictionary. For the list of multiword types see #27.

Types of compounds, which are written separately:

1) Verb (st. abs.) + ⲛ/ⲙ/ⲉ/ⲉⲛ + article + noun (non-possessed): e.g.

ϯ ⲙⲡⲓⲙⲱⲓⲧ “give way” ϯ ⲛⲟⲩϭⲓⲙϣⲓϣ “take vengeance” ϯ ⲉⲡⲟⲩϣⲁⲡ “lend” ϩⲉ ⲉⲡⲟⲩⲟⲉⲓϣ “find time” ϯ ⲉⲡⲥⲱⲧⲉ “pay ransom” etc.

2) Verb (st. abs.) + ⲛ/ⲙ/ⲉ/ⲉⲛ + Ø + noun (non-possessed): e.g.

ϯ ⲛⲉⲩⲱ “give as pledge” ϫⲓ ⲛϭⲟⲛⲥ “use violence, do evil” ⲉⲣ ⲛⲁⲧⲑⲱⲧ ⲛϩⲏⲧ“disagree” ϫⲓ ⲉⲃⲉⲕⲉ “hire” ϭⲓ ⲛⲥⲕⲉⲛϩⲟ “gut aussehen”

3) Verb (st. abs.) + preposition / adverb: e.g.

ϯ ⲉϩⲟⲩⲛ (ⲉϩⲣⲛ-) “oppose” ϥⲓ ⲙⲛ “agree with” ⲟⲩⲱⲧⲉⲃ ⲥⲁⲃⲟⲗ “step over” ⲛⲟⲩ ϩⲛⲧⲟⲩⲱ “sit (to eat)”

4) Verb (st. abs.) + Ø + Ø + noun (non-possessed): e.g.

ⲃⲱⲗ ϣⲧⲱⲣⲉ ⲉⲃⲟⲗ “dissolve a guarantee”

5) Verb (st. abs.) + Ø + possessive pronoun + noun (non-possessed): e.g.

ⲥⲓⲧⲉ ⲛⲉϥⲟⲩⲉⲗⲗⲉ „recitate one's poetry“

6) Verb (st. abs.) + ⲛ/ⲙ/ⲉ/ⲉⲛ + Ø + noun (possessed) (+ suffix): e.g.

ϭⲛⲟⲛ ⲛϫⲱ⸗ “obey” ϯ ⲛⲓⲁⲧ⸗ “observe” ⲱϩⲉ ⲉⲣⲁⲧ⸗ “stand on foot” ϯ ⲉⲣⲁⲧ⸗ “put on foot” ϯ ⲛⲧⲟⲟⲧ⸗ “give a hand, help”

phoenix-mossimo commented 7 years ago

Are you sure all of those are bound groups (ex. §3) ? Sure they are when used as nouns but as verbs? I tried "norm_group=/.ϫⲓⲛϭⲟⲛⲥ./" vs. lemma="ϭⲟⲛⲥ" and many "ϫⲓ ⲛϭⲟⲛⲥ" are not found, also because of inconsistent encoding. Let me check the conditions first.

amir-zeldes commented 7 years ago

No, you're right, they are not necessarily bound groups; the solution to search in bound groups is really just a 'band aid' - it may work sometimes and is better than nothing, but is not an absolute solution. The real solution is to specify the sequence of norms in oRef - that's what @mjabrams is working on. Then it won't matter if they're in the same bound group, because ANNIS will search for a sequence of norms regardless of bound group borders.

phoenix-mossimo commented 7 years ago

By "specifying the sequence of norms in oRef" do you mean a) extending the XML files (for all compounds ) or b) doing it on-the-fly why generating the query for a given compound?

In a) we would be very much interested, that is a part of the plan actually.

amir-zeldes commented 7 years ago

I'm not sure I understand the difference, but I think a) . Isn't this already what the oRef tags do?

phoenix-mossimo commented 7 years ago

"oRef" was applied to "multiwords" only, type of compounds defined in #27, but not to all compounds.

amir-zeldes commented 7 years ago

I see... Yes, it would be better to either apply it to all complex entries, or have another similar tag for non "multiword" compounds. Basically, if there is some clear way for us to figure out what to search for in ANNIS, that would be best.

But in the meantime, if something doesn't have oRef, but does contain spaces, and is in the same bound group, the fallback of norm_group=/.XYZ./ will catch it, so we will have fairly good coverage already (this is not a permanent solution, but not terrible for now IMO)

dwerning commented 5 years ago

What about the state of this issue? (Milestone 2.1.0?)

phoenix-mossimo commented 5 years ago

Yes. We need to manually tag the parts of all compounds within the tag. It is a good task for the studentische Hilfskraft or Werkvertrag.

amir-zeldes commented 5 years ago

That sounds good, we could then update our mwe tagger to be based on the new list of oRef elements