dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

Adding a "flavor" feature to the Dependency type #879

Closed reckart closed 6 years ago

reckart commented 8 years ago

Normally, dependency relations form a tree. However, e.g. in the CONLL-U format, there are also provisions for dependencies that are not a tree. Also, CoreNLP drops the guarantee that dependencies form a tree if dependencies are collapsed (they state it can include cycles and re-entrancies).

I am thinking about introducing a new feature on the Dependency type called "secondary" which would be set to "true" if the dependency relation is not part of a tree of primary dependency relations. For CONLL-U, such relations would be stored in a separate column. For other CONLL formats and formats supporting only tree-structured dependencies, such secondary dependencies would be entirely omitted.

Any opinions?

jnivre commented 8 years ago

This sounds like a simple yet adequate solution to the problem of distinguishing between primary and secondary dependencies for frameworks that use this distinction. Let me just check that I have understood correctly. For CoNLL-X, all dependencies would have this feature set to false, because they must form a tree (and no secondary dependencies are allowed). For CoNLL-U, primary dependencies would have false and secondary true. What about frameworks that only recognise one set of dependencies but where these are not required to form a tree (like old-style Stanford collapsed dependencies or one of the new data sets with so-called semantic dependencies)? Would then all dependencies have the feature set to true?

reckart commented 8 years ago

@jnivre that's a good question. One option might be retain all uncollapsed dependencies (with secondary = false) and then additionally add the collapsed dependencies on top (with secondary = true). At least in that way, no information would be lost.

reckart commented 8 years ago

@jnivre I guess for a data set where the dependencies are not expected to be a tree at all, all the dependencies would have secondary set to true. Maybe "secondary" is not a good name then...

Do you know how information from such a new-style framework would be rendered in CONLL-U? Would the HEAD and DEPREL columns be empty then and all dependencies go to the DEPS column?

dan-zeman commented 8 years ago

Unless the CoNLL-U format specification is modified you are always required to provide a rooted tree using HEAD and DEPREL. I guess if it does not make sense for what you're doing then all HEADs are 0, all DEPRELs are root, and the validator must be run with the option that single root is not required.

jnivre commented 8 years ago

@reckart I don't think these frameworks would use CoNLL-U at all, so perhaps it is a less relevant question right now. They are more likely to use a version of the CoNLL-2009 format, as discussed here: http://alt.qcri.org/semeval2014/task8/index.php?id=data-and-tools

reckart commented 8 years ago

@judithek @zesch Any comments from your sides?

reckart commented 8 years ago

Btw. LAPPS uses a "dependencyType" feature which indicates values like "basic", "collapsed", etc. That may be a more flexible alternative to a simple boolean flag. With respect to the CONLL formats, it could be interpreted similarly. I.e. "basic" dependencies would be expected to have a tree structure (or at least a single head) and dependencies with other types would be put to the DEPS column in CONLL-U and be omitted in other CONLL formats.

zesch commented 8 years ago

I haven't followed the discussion in all details and would be fine with whatever you decide here.

judithek commented 8 years ago

A "dependencyType" feature which indicates values like "basic", "collapsed" would be more flexible - I would prefer it over a binary flag. If I understand it correctly, it could also nicely be used to represent a CoNLL-2009 format-based semantic depedency graph (e.g. using a value "semantic").

reckart commented 8 years ago

@judithek We currently read CoNLL-2009 semantic columns into SemPred and SemArg annotations - is that wrong?

judithek commented 8 years ago

no, it should be right if you have an application that needs SemPred and SemArg annotations (i.e. semantic role labeling). As I understand it, the semantic dependency graphs are more general than semantic role labeling annotations: see http://alt.qcri.org/semeval2014/task8/index.php?id=dependency-formats

The semantic dependecy graph is represented using an adapted version of the CoNLL-2009 format, where only a subset of the CoNLL 2009 inventory is used plus additional columns for top predicate and arguments, as described here: http://alt.qcri.org/semeval2014/task8/index.php?id=data-and-tools

reckart commented 8 years ago

Well, that would be the SemEval 2014 format then, I guess and we would add a separate reader for that if desired.

jnivre commented 8 years ago

The idea of having a multi-valued (rather than binary) feature seems appealing. As far as UD is concerned, only the values "basic" and "enhanced" would be needed (for now). The notion of a "collapsed" dependency is no longer relevant in the UD framework.

reckart commented 8 years ago

Since we already use a feature named "DependencyType" to indicate the relation label, I'm going to introduce a new feature called "flavor".

jnivre commented 8 years ago

Maybe "vanilla" and "chocolate" is better than "basic" and "enhanced" then. :)

reckart commented 7 years ago

I am starting to wonder whether it was a good idea to model this as a "flavor" feature. Within DKPro Core, that works reasonably ok. But considering to transfer this into WebAnno, it seems to me that introducing an EnhancedDependency type might be the better choice. I think it would better allow to model the different behaviours of the two types, i.e.:

The drawback of course would be that a select(jcas, Dependency) call would no longer return all dependencies, but only the basic ones.

Of course, if there was some way of looking sharply at a dependency structure and being able to tell which relations are basic and which are not might also lead to a nice solution. Then the decision whether an edge is basic or enhanced could be done automatically and could be deferred until the data is actually serialized into CoNLL-U... but I fear it might not be possible to make that distinction in a generic way.

oepen commented 7 years ago

i am not quite sure how you would define ‘basic’ in this context?

in my view, single-rooted, fully connected trees are on the way out. variants of the stanford dependencies have long given up both the single-head and no-isolated-nodes properties. does DKPro Core (try to) tease apart the various types of edges delivered by CoreNLP when requesting these dependency variants?

i consider it an anachronism in CoNLL-U to make the distinction between the ‘primary’ and ‘enhanced’ dependency structures. i believe that general graphs are the future, and i see no (straightforward) principled (linguistic) way to single out one of the incoming edges on nodes that exhibit reentrancies, maybe particularly so in more semantically oriented dependency representation, e.g.

http://www.lrec-conf.org/proceedings/lrec2016/pdf/887_Paper.pdf

jnivre commented 7 years ago

The background here (I guess) is how to configure WebAnno for UD annotation. UD still requires a set of basic dependencies that form a spanning tree over the words of the sentence and is likely to do so for the foreseeable future. However, a difference in v2 (released December 1) is that enhanced dependencies are not always a superset of the basic dependencies. So what is needed for v2 is a mechanism for specifying whether a dependency belongs to basic, enhanced or both and (as before) for checking that the set of basic dependencies form a tree. Note also that enhanced dependencies in v2 may involve empty nodes, while basic dependencies may not.

Joakim

On 14 Dec 2016, at 05:35, Stephan Oepen notifications@github.com<mailto:notifications@github.com> wrote:

i am not quite sure how you would define ‘basic’ in this context?

in my view, single-rooted, fully connected trees are on the way out. variants of the stanford dependencies have long given up both the single-head and no-isolated-nodes properties. does DKPro Core (try to) tease apart the various types of edges delivered by CoreNLP when requesting these dependency variants?

i consider it an anachronism in CoNLL-U to make the distinction between the ‘primary’ and ‘enhanced’ dependency structures. i believe that general graphs are the future, and i see no (straightforward) principled (linguistic) way to single out one of the incoming edges on nodes that exhibit reentrancies, maybe particularly so in more semantically oriented dependency representation, e.g.

http://www.lrec-conf.org/proceedings/lrec2016/pdf/887_Paper.pdf

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dkpro/dkpro-core/issues/879#issuecomment-266854154, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHP81m3K7gfXBHjfEZ3MkCcF2gcLOC9hks5rHwGbgaJpZM4I0VGK.

oepen commented 7 years ago

my apologies: being a near-perfect github novice, i had not realized i could simply scroll up to see more context.

—for a dependency in UD v2 that is considered both basic and enhanced, does the edge occur twice in CoNLL-U, i.e. as a basic dependency in HEAD and DEPREL, and as an enhanced one in DEPS? if not, how would CoNLL-U represent this type of dependencies?

from the discussion so far, i had not gleaned a strong argument for anything but a binary distinction (i.e. basic-ness). but this new feature of UD v2 would then seem to either call for a three-valued distinction, or for actually duplicating the edge (as i suspect has to be done in CoNLL-U). assuming that edges that are both basic and enhanced, nevertheless, should be considered one entity, the latter solution would appear inferior to me.

jnivre commented 7 years ago

On 14 Dec 2016, at 06:20, Stephan Oepen notifications@github.com<mailto:notifications@github.com> wrote:

my apologies: being a near-perfect github novice, i had not realized i could simply scroll up to see more context.

—for a dependency in UD v2 that is considered both basic and enhanced, does the edge occur twice in CoNLL-U, i.e. as a basic dependency in HEAD and DEPREL, and as an enhanced one in DEPS? if not, how would CoNLL-U represent this type of dependencies?

It occurs twice. More info: http://universaldependencies.org/format.html http://universaldependencies.org/u/overview/enhanced-syntax.html

from the discussion so far, i had not gleaned a strong argument for anything but a binary distinction (i.e. basic-ness). but this new feature of UD v2 would then seem to either call for a three-valued distinction, or for actually duplicating the edge (as i suspect has to be done in CoNLL-U). assuming that edges that are both basic and enhanced, nevertheless, should be considered one entity, the latter solution would appear inferior to me.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/dkpro/dkpro-core/issues/879#issuecomment-266865444, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHP81knNGpznkn1u1l3K3jXlPdrN68d-ks5rHwwdgaJpZM4I0VGK.

reckart commented 7 years ago

Several readers and components set the flavor to ENHANCED based on whether a node has multiple governors or not or e.g. in the case of the Stanford components whether the "extra" flag is set on the dependency or not. I think that is the best we can do atm.

reckart commented 7 years ago

It seems the definition enhanced dependencies in UD 2 shifted into a direction where modelling enhanced dependencies as a separate layer instead of having them as a "flavor" of dependencies in general: http://stp.lingfil.uu.se/pipermail/ud/2017-November/000488.html

reckart commented 7 years ago

I am considering to undo the addition of the "flavor" feature and instead introduce a new type "EnhancedDependency".

jnivre commented 7 years ago

Sounds good to me. Sorry for making your life hard by changing the standard. :)

oepen commented 7 years ago

colleagues,

a separate type ‘EnhancedDependency’ sounds a bit like an UD-specific patch to me. would it be worthwhile to see whether a generalization to multi-layer dependency structures could cover the UD representations and also scale to other use cases? personally, i could be interested in encoding the prague a- and t-layers jointly (where one is a connected tree, the other a tree where some surface tokens are unattached, and there can be empty t-layer nodes); as well as in representing various flavors of surface dependencies together with ‘my’ semantic dependencies (see http://sdp.delph-in.net). ideally, the design would not be limited to just two layers. with a little bit of imagination, i could see a stack of parsers computing semantic dependencies on top of both of the current UD layers ...

cheers, oe

reckart commented 7 years ago

Thanks for the feedback!

I don't want to rush discarding the flavor property in favor of another solution. But since DKPro Core 1.9.0 wants to be released rather sooner than later and since flavor was added for 1.9.0, it would be easier to replace it now than later (i.e. after it has been included in a release). So it is a good time to consider whether this should stay or be revised.

(sorry for meandering thoughts below...)

The "flavor" feature that was introduced as part of this issue basically provides a subcategorization mechanism that operates outside of the type system inheritance hierarchy. So instead refactoring the type system to a type/subtype construction such as

Annotation
   Dependency
   EnhancedDependency

or even

Annotation
   Dependency
      BasicDependency
      EnhancedDependency

it was seemed more straight-forward to add a feature to do subcategorization

Annotation
   Dependency (flavor = basic | enhanced)

That solution was a minimal change to the type system and also allows supporting additional flavors without changing the type system. In particular, it required no changes to the current subtype structure of Dependency which are "elevated types" such as de.tudarmstadt.ukp.dkpro.core.api.syntax.type.dependency.ACOMP etc. Also using the flavor feature instead of subtyping follows a similar approach in the LAPPS Vocabulary where the respective feature is called dependencyType.

Readers and writers have been changed to specially handle the value of the flavor feature. However, from the perspective of the type system, basic and enhanced (and any other kind of flavor that might be introduced) are still one annotation layer.

The DKPro Core type system is also used by WebAnno, so in addition to the perspective of automatic processing, we occasionally get input for the type system design from the perspective of manual annotation. Some that perspective, it seems that considering basic and enhanced dependencies as separate layers would make sense. If they are modelled as separate layers, one could

Supporting this with the current flavor feature would require implementing a more fine-grained control over the coloring strategy in WebAnno as well as adding some capability to filter annotations from the view based on feature values (instead of just based on the layer they are on). These may in fact be very sensible extensions.

@oepen Does the flavor feature adequately support the multi-layer dependency structures you envision or are you imagining yet a more general type system design?

reckart commented 6 years ago

Ok, then let's leave it as it is for the time being so that it doesn't block the 1.9.0 release.