KorAP / KorAP-XML-TEI

Conversion of TEI P5 based formats to KorAP-XML
BSD 2-Clause "Simplified" License
2 stars 0 forks source link

Support inline dependency annotation #7

Closed Akron closed 1 month ago

Akron commented 2 months ago

In addition to inline token support, inline dependencies may also be supported in a limited (i.e. token-based and sentence-scoped) form. Proposal by Harald Lüngen:

<s>
 <w n="1" lemma="Fake" pos="N" head="2" deprel="name">Fake</w>
 <w n="2" lemma="News" pos="N" head="3" deprel="name">News</w>
 <w n="3" lemma="media" pos="N"  head="0" deprel="ROOT">Media</w>
...
</s>

Requested by Harald Lüngen.

bansp commented 2 months ago

That is more or less in line with what the LingSIG has been pondering as an addition to the shallow grammatical description tools. It's a nice nudge for the group, too -- thanks, @luengen !

Akron commented 2 months ago

@bansp That's great - can you link to a description of this specification/proposal?

luengen commented 1 month ago

At last, here is the test file. Is it ok? SKU21.head.i5.xml.zip

luengen commented 1 month ago

Note that besides @head and @deprel, another attribute called @msd is proposed which unlike @head and @deprelis already in the official TEI.

luengen commented 1 month ago
<s> 
 <w n="1" lemma="Fake"  pos="N" head="2" deprel="name" msd="SUBCAT_Prop|CASECHANGE_Up|OTHER_UNK">Fake</w> 
 <w n="2" lemma="News"  pos="N" head="3" deprel="name" msd="SUBCAT_Prop|CASECHANGE_Up|OTHER_UNK">News</w> 
 <w n="3" lemma="media" pos="N" head="0" deprel="ROOT" msd="NUM_Sg|CASE_Nom|CASECHANGE_Up">Media</w> 
Akron commented 1 month ago

I'll give it a try. The self-referentiality of e.g.

<p xml:lang="x-|xxx:1|">
  <s xml:lang="xxx">
    <w deprel="ROOT" head="0" lemma="uo" msd="_" n="1" pos="Num">uo</w>
  </s>
</p>

is important?

luengen commented 1 month ago

How is it self referential?

Akron commented 1 month ago

Maybe I am misreading that. What should head="0" mean?

luengen commented 1 month ago

In a dependency tree, there is always one particular word at the root of the tree (usually the finite verb). This root word is strictly speaking not dependent on another word. In the present analyses, the root word seems to be consistently marked with head=0 and deprel=ROOT, apparently to ensure that every column has a value. I thought this was common for UD annotations in CONLL-U

Akron commented 1 month ago

Yes, but that means it's self-referential. Or it can't be indexed (i.e. found).

Akron commented 1 month ago

If I remember correctly, there are also annotations that refer to the sentence-span for root.

luengen commented 1 month ago

Yes that's how one could understand the head=0. Do you mean we should add n="0" to every <s> to make it explicit? That seems redundant, and is not in the source either

Akron commented 1 month ago

No - my question is really how this should be interpreted. 0 (and with it root) could be completely ignored (simple; not queryable), 0 could refer to the same token (self-referential; simple), or 0 could refer to the embedding span (not so simple and may be not so queryable).

bansp commented 1 month ago

@bansp That's great - can you link to a description of this specification/proposal?

Nope, not yet, as far as a coherent suggestion from the group is concerned. I'll see if I can find some links to others' proposals in the meeting minutes (it's been a while since the last one).

bansp commented 1 month ago

No - my question is really how this should be interpreted. 0 (and with it root) could be completely ignored (simple), 0 could refer to the same token (self-referential; simple), or 0 could refer to the embedding span (not so simple and may be not so queryable).

But that's not a clear question :-) , there's at least one hidden assumption here -- interpreted by what, for what purpose? In the dependency grammar used here, each relation is labelled, and the one targeting "0" has the label "root". It is not a reflexive relation. "0" is like "/" in XPath, it's not a word, just an abstract node (just like the root node in an XML tree is not the root element -- it is the abstract parent of the root element). The label "root" basically says "start here". I'm guessing that your question asks how the KorAP engine should interpret this internally, for the purpose of indexing (correct?). I don't have a clear answer to that, because I know too little. :-\

Akron commented 1 month ago

Not really how it should work internally but - user-centric - how the data should be queryable. There are three options possible afaik.

bansp commented 1 month ago

Naively, I'd look towards ANNIS for guidance -- at least it's tested. And general, handling also constituent trees. Ah, or MTAS, of course. I am not sure if POLIQARP has received some new relation-oriented extensions, but employing them would mean a separate addition to KorAP anyway, I guess.

luengen commented 1 month ago

But you have dependency annotations in KorAP already, and I would assume they have a root word with head=0 in just the same way, and I would also assume that the conllu2korapxml already has a particularly treatment for these and that they can be queried in a certain way. Can't we derive the answer from there?

Unfortunately I don't know how dependency relations are currently queried in KorAP

Akron commented 1 month ago

Existing root relations coming from conllu2korapxml point to the sentence span. I am fine with that.

bansp commented 1 month ago

That seems to be a consistent way of handling this in a span-oriented setting. Why not go for that and see what happens.

luengen commented 1 month ago

yes it seems to make sense