KorAP / Koral

:pencil: Translation of query languages to serialized KoralQuery protocol
BSD 2-Clause "Simplified" License
10 stars 4 forks source link

ANNIS QL: Serialization of dominance #31

Closed margaretha closed 7 years ago

margaretha commented 7 years ago

Dominance is serialized as a relation with the layer c and without a key, that is not a valid koral:term object.

Suggestion: node & node & #2 > #1 node & node & #2 ->dominance #1

could be serialized identically as a relation with key dominance.

 "operation": "operation:relation",
    "relation": {
      "@type": "koral:relation",
      "wrap": {
        "@type": "koral:term",
        "key": "dominance"
      }
    }
Akron commented 7 years ago

As a non-typed relation of constituency, this will be deserialized as a positional query, so I am not sure it should be serialized as a relation at all.

margaretha commented 7 years ago

Could you write a suggestion on how the positional query would be?

Akron commented 7 years ago

I think Joachim and me had the following serialization in mind:

{
  "@type" : "koral:group",
  "operation" : "operation:position",
  "frames" : ["frames:startsWith", "frames:endsWith", "frames:isAround", "frames:matches"],
  "boundary" : {
    "@type" : "koral:boundary",
    "min" : 0,
    "max" : "3"
  }
  "operands" : [X,Y]
}

Although "boundary" may have a different name. It describes the minimum and maximum depth between X and Y.

margaretha commented 7 years ago

I like the depth definition, but operation:sequence does not seem quite right.

Dominance looks more like a specific relation since its behaviour is very similar to a labelled pointing relation with an implicit label dominance. But it is more on the constituency/hierarchy/vertical level, while relation is not bound to any level (typically used for anaphoric/coreference relations).

Edit: operation:relation without relType should be fine. Besides, we should also handle attribute/edge type of the dominance. See #30

Akron commented 7 years ago

I already fixed the operation directly after editing. ;)

Relations need specific annotations, so I don't think it's useful here.

margaretha commented 7 years ago

Oh I have to refresh the page manually. But why frames?

Edit: Do you mean koral:relation? but relType is optional. so we don't need koral:relation here. we can also make wrap in koral:relation optional and add optional attr.

Akron commented 7 years ago

Currently ["frames:contains"] is the default for frames in operation:position, but as this is no real frame, but a shortcut for frames, I think we will remove it in favor of the 13 frame system, Joachim already used.

With relations I meant relations in Krill. How would you serialize dominance without a relType using operation:relation?

margaretha commented 7 years ago

Currently ["frames:contains"] is the default for frames in operation:position, but as this is no real frame, but a shortcut for frames, ...

I mean why do we need to define the frames to solve dominance?

I suppose this should be enough

{
  "@type" : "koral:group",
  "operation" : "operation:relation",
  "depth" : {
    "@type" : "koral:boundary",
    "min" : 0,
    "max" : "3"
  }
  "operands" : [X,Y]
}

No, in Krill it should not be handled using RelationSpans and there shouldn't be a problem in adjusting the deserialization, right?

Akron commented 7 years ago

Hm - I think operation:position here would be more meaningful. otherwise it's an unlabeled relation with a "depth" called boundary ... This could also be a relation in a dependency tree.

margaretha commented 7 years ago

otherwise we have to specify koral:relation as my first suggestion. The relation is indeed implicit.

For operation:position, it should have

  1. an indication of the vertical depth (maybe add a new frame type?)
  2. attribute
Akron commented 7 years ago

I think, as you proposed, having a koral:boundary attribute with the name depth should be enough, don't you think?

margaretha commented 7 years ago

I mean an attribute/edge-type like >[func="sbj"]. Maybe it depends on the corpus data but it exists in AQL.

Akron commented 7 years ago

If a label exists, we can use operation:relation of course, but an undefined relation seems to me a bit weird. In case it's a hierarchical relation, I prefer operation:position.

margaretha commented 7 years ago

The operator > is always dominance, other relations are defined with the operator -> so >[func="sbj"] is also a hierarchical relation. I am fine with operation:position but I think > and >[func="sbj"] should not be serialized into two different koral operations when they are meant to be identical operations in AQL, of course the latter with an addition of a type/attribute.

Akron commented 7 years ago

Ah - I see. Okay - yes, that makes sense. You are right - we shouldn't have different serializations for similar constructs.

margaretha commented 7 years ago

So both of them with operation:position or operation:relation ? Either way, we need an additional attr like in relType for operation:relation. I am not sure where we can add it for operation:position.

I think the many frames definition

"frames" : ["frames:startsWith", "frames:endsWith", "frames:isAround", "frames:matches"],

for hierarchical position does not really make sense. Firstly, it is not in the Annis query itself. Secondly, the frames are supposed to define the position of the two operands, which is vertical/hirarchical, not startswith etc. Thirdly, they are not really used for solving the query, are they?

Akron commented 7 years ago

operation:relation, although I dislike that "hierarchical dominance" is somehow implicite then.

And: Yes, they are used for solving the query.

margaretha commented 7 years ago

How about defining a new operation? Using operation:relation is actually our own interpretation since the "dominance" syntax is similar to a relation.

Akron commented 7 years ago

So you would vote for operation:dominance or operation:hierarchy?

margaretha commented 7 years ago

sry for my late reply. I have just found your answer and didn't receive emails regarding these comments. hmm, operation:hierarchy sounds more general than operation:dominance. Dominance very much refers to Annis.

Akron commented 7 years ago

That means you would prefer hierarchy? We then may be able to reuse this for, e.g. CSS selector queries, in case we want to support that.

margaretha commented 7 years ago

yes, it's good if we can reuse it!

margaretha commented 7 years ago

Dominance queries will now be serialized with operation:hierarchy. For example, the query node > cnx/c="np" is serialized as follows:

{
    "@context": "http://korap.ids-mannheim.de/ns/koral/0.3/context.jsonld",
    "query": {
        "operation": "operation:hierarchy",
        "operands": [
            {"@type": "koral:span"},
            {
                "@type": "koral:span",
                "layer": "c",
                "foundry": "cnx",
                "match": "match:eq",
                "key": "np"
            }
        ],
        "@type": "koral:group"
    }
}

Dominance query with boundary: node & node & #1 >2,4 #2

{
    "@context": "http://korap.ids-mannheim.de/ns/koral/0.3/context.jsonld",
    "query": {
        "operation": "operation:hierarchy",
        "operands": [
            {"@type": "koral:span"},
            {"@type": "koral:span"}
        ],
        "@type": "koral:group",   
        "boundary": {
            "min": 2,
            "max": 4,
            "@type": "koral:boundary"
        }
    }
}

Dominance query with label: "Mann" & node & #2 >[func="SBJ"] #1 I am not quite sure with the serialization for this query. Annis func="SBJ" is the same as const:func="SBJ". I think it means that func is a constituent layer, so I would serialized it:

{
    "@context": "http://korap.ids-mannheim.de/ns/koral/0.3/context.jsonld",
    "query": {
        "operation": "operation:hierarchy",
        "operands": [
            {"@type": "koral:span"},
            {
                "wrap": {
                    "@type": "koral:term",
                    "layer": "orth",
                    "match": "match:eq",
                    "key": "Mann"
                },
                "@type": "koral:token"
            }
        ],
        "@type": "koral:group",
        "label": {
            "@type": "koral:term",
            "layer": "c",
            "match": "match:eq",
            "key": "SBJ"
        }
    }
}

Joachim noted that "c"-layer term (consituency relation/dominance) is needed, so the label would be a termgroup:

"label": {
            "operands": [
                {
                    "@type": "koral:term",
                    "layer": "func",
                    "match": "match:eq",
                    "key": "SB"
                },
                {
                    "@type": "koral:term",
                    "layer": "c"
                }
            ],
            "@type": "koral:termGroup",
            "relation": "relation:and"
        }

What do you think?

Akron commented 7 years ago

I think the first serializations look fine - and when ignoring the empty spans, they are deserializable in Krill. Some comments:

  1. Can dominance span multiple foundries/layers, or should the undefined nodes have the same foundry and layer as the other operands? In the third example there is a dominance relation with a surface term - what does that mean?
  2. I think "relType" and "label" are quite similar, so I guess we should merge their usage.
  3. Regarding the last serialization: Is "func" really a layer? That looks weird ...
margaretha commented 7 years ago

Can dominance span multiple foundries/layers, or should the undefined nodes have the same foundry and layer as the other operands?

I am not sure, but Annis doc section 4.6 suggests it can be "any" node or annotation, so I think multiple foundries/layers are ok.

In the third example there is a dominance relation with a surface term - what does that mean?

Why not? A node can be anything such as text, can't it?

  1. I think "relType" and "label" are quite similar, so I guess we should merge their usage.

Ok good idea!

  1. Regarding the last serialization: Is "func" really a layer? That looks weird ...

well, according to the ANNIS doc, "func" is an annotation and I believe it is supposed to be a functional dependency annotation. In the example const:func, const is a namespace and means constituent, but I don't think it has to be constituent. So the namespace allows for multiple layers. I don't really get why "c"-layer term is needed as Joachim suggested.

Btw, namespaces are not handled yet. Should we support this?

Akron commented 7 years ago

Hm - as they only show examples with single trees, I would say no. I don't even know how this could be handled.

Ah, right. But the addition of foundry and layer in the case of arbitrary nodes, should be done in Kustvakt rewriting, right?

I don't say it's not possible, but I guess the term needs to be part of the tree (in that case a leaf node). So it needs to be indexed as part of the tree structure to have a depth (and be a span) - meaning it needs the same foundry/layer. Or am I missing something? Otherwise this would mean, there is a meaningful alignment of different hirerarchies in the document's annotation and I don't know how this is possible. Though: For leaf nodes I accept that a translation to the relevant foundry/layer can be implicite.

That it needs the same foundry/layer makes sense. But should this be restricted in Koral as a matter of syntay?

I would say, const is the layer, func is the key, SB is the value, though the translation would not match our index at all in Krill.

Oh that's cool. Hm that depends on the annotation data. Do we have any hierarchy annotations in our data at all? How about we support this format: [foundry/layer:key=value] ?

margaretha commented 7 years ago

Btw, there is another parameter. Dominance edge may be specified into some type, like "secondary edge" when a node has more than one child node, e.g. >secedge[foundry/layer:key=value] In this case, it seems like the ranking of the edges (primary, secondary) are explicitly annotated. Would such type exist in our annotations?

Can dominance span multiple foundries/layers, or should the undefined nodes have the same foundry and layer as the other operands?

There was at least an example with 2 different layers: constituent and POS. cat="NP" > pos="RB" where RB is an adverb.

ps: sry, I've just realized that I answered directly in your last comment.

bansp commented 7 years ago

Hi Eliza, a quick note: secondary edges are not for additional children. They are a legacy concept from the early Tiger XML datamodel, where primary edges indicated either constituency or dependency in the relational-grammar sense, while secondary edges expressed non-local relations, such as co-reference (and probably also "movement" in some grammatical models). In later models, there was a unified concept of edge with different labels (and maybe with separately indicated different functions, but I can't recall that clearly now).

bansp commented 7 years ago

nd> Do we have any hierarchy annotations in our data at all?

I think that at least XIP annotations indicated both hierarchy and dependency (dependencies were sometimes defined not over terminals but over a terminal vs. a phrase).

margaretha commented 7 years ago

Hi Piotr, thanks for the clarification about secondary edges! I cannot really imagine what secondary edges would be in the hierarchical sense though.

Annis has separate operators for dominance (hierarchy) and relations (e.g dependency & co-reference).

bansp commented 7 years ago

em> I cannot really imagine what secondary edges would be in the hierarchical sense though.

They weren't used for hierarchies. They were "secondary" exactly because they violated the basic ("primary"?) Tiger tree model based on dominance. So, for example, a secondary edge could link "him" to "her boyfriend" in a sentence "Her boyfriend is overall a nice guy, but I still don't like him", where there is definitely no hierarchical relationship between the two nodes in question.

margaretha commented 7 years ago

They weren't used for hierarchies. They were "secondary" exactly because they violated the basic ("primary"?) Tiger tree model based on dominance.

then it is strange to have such a type in Annis dominance while it also has pointing relation operator that is more appropriate for the secondary edges. Do you think it is specifically added for supporting data using this early Tiger XML datamodel ?

bansp commented 7 years ago

Oh gosh, I have completely no idea and can't investigate this at the moment (still not done with my contribution to tomorrow's evaluation ;-)). But I know a perfect person to ask, if he has the time and can be bothered: @amir-zeldes (whom I can't apparently reference from here because of some formal reasons :-/ )

amir-zeldes commented 7 years ago

Hi @bansp, the reference seems to work fine! I got this message anyway. I'll try to answer below:

Secedges in the Tiger corpus expressed forms of structure sharing that violated the unique parent assumption, for example right node raising, gapping, etc. As such, they are considered to be proper dominance relations (they imply inherited coverage, unlike pointing relations). You can see an ANNIS example from the Potsdam Commentary Corpus here (open the constituents view): cat="S" >secedge tok="was"

However, the device of edge typing is more general than that in ANNIS, which is based on Salt. In practice it usually follows the modeling strategies used by PAULA XML so maybe looking at those is a better way to understand things (see the PAULA documentation). In a nutshell, ANNIS edges have:

I hope that answers the question and gives an idea of the data model - if anything is unclear just let me know.

margaretha commented 7 years ago

Hi @amir-zeldes, thank you for your explanation! I am not sure how multiple annotations are separated or if they have boolean operations in an edge label since I cannot find it in the Annis 3 documentation. Could you please give an example?

amir-zeldes commented 7 years ago

You're right, AQL doesn't provide any special syntax for boolean operations on edge annotations, and TBH I think we also don't really have an example corpus containing multiple edge annotations. That said, you can get that kind of behavior with a more complex query. If you want boolean AND, you can simply repeat the relation declaration. There is no reflexivity constraint on the edge connection, so this will match AND:

x & y & #1 ->rel[anno1="val"] #2 & #1 ->rel[anno2="val"] #2

It's not exactly elegant, but it should work. For OR you can use the general disjunction using |. Either full query, or in more recent ANNIS versions also relation disjunction:

Relation disjunction:

tok & tok & (#1 ->dep[func="nsubj"] #2 | #1 ->dep[func="nsubjpass"] #2)

Full query disjunction:

tok & tok & #1 ->dep[func="nsubj"] #2 | tok & tok & #1 ->dep[func="nsubjpass"] #2

Example in GUM: https://corpling.uis.georgetown.edu/annis/#_q=dG9rICYgdG9rICYgKCMxIC0-ZGVwW2Z1bmM9Im5zdWJqIl0gIzIgfCAjMSAtPmRlcFtmdW5jPSJuc3VianBhc3MiXSAjMik&_c=R1VN&cl=5&cr=5&s=0&l=10

Akron commented 7 years ago

nd> Do we have any hierarchy annotations in our data at all?

@bansp: Yes, we have constituency annotations from CoreNLP and dependency annotations from MALT.