dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

Token annotations cannot carry syntactic function #174

Closed reckart closed 7 years ago

reckart commented 9 years ago
The Stanford Parser produces syntactic functions for terminals (tokens), e.g. NN-DA
or NN-DO (noun + function). In the Consituent annotation, we can maintain the syntactic
function in the CAS, but the Token annotation does not have that. 

Probably the "Token" type should have a syntactic function feature as well?

Original issue reported on code.google.com by richard.eckart on 2013-06-30 17:03:59

reckart commented 9 years ago
Is there some documentation about this syntactic functions somewhere? I couldn't find
something in a quick search.

Original issue reported on code.google.com by torsten.zesch on 2013-09-04 12:38:55

reckart commented 9 years ago
Yep, here: http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/kanten.html
The pos tags are here: http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/stts.asc

E.g. NN-DA = normal noun (NN) + dative (DA)

The connection between the token layer and the constituent layer is a bit awkward at
the moment. E.g. the "parent" feature in the token is of type Annotation because the
token is in the segmentation API while the type Constituent is in the syntax API (and
syntax depends on segmentation, so segmentation cannot depend on syntax). We might
want to consider if the token doesn't really belong to the syntax API, or if we can
find a way that the syntax API doesn't depend on the segmentation API. Btw. the dependency
is introduced because the Dependency type uses Token as its endpoints. So if we move
Dependency somewhere else... e.g. to "api.syntax.dependency".

Original issue reported on code.google.com by richard.eckart on 2013-09-04 13:01:49

reckart commented 9 years ago
I might miss a point, but I don's see why token should carry syntactic function at all.
But I guess this is the old discussion about features instead of offset bound retrieval
:)
So if you think it makes sense to have it in token, I am fine with it.

Original issue reported on code.google.com by torsten.zesch on 2013-09-04 13:12:55

reckart commented 9 years ago
Here is my current opinion on the matter of offsets vs. features:

Offsets are a good starting point, in particular if it is not clear how often a navigation
path is used, if extensibility is an issue, and if one is not familiar yet with the
details of what is to be annotated.

Features are good if it is known that a navigation path is used often (and should be
reasonably fast), it it is known that extensibility is not a problem, which entails
that there is a good familiarity with what is to be annotated.

In this issue, we have the case that we know there are syntactic function labels on
edges between constituents. There is a corresponding feature in the Constituent type
(although, admittedly, afaik we don't use it much). We treat Tokens as terminals in
the constituency structure, but in fact, in our type system, Token does not inherit
from Constituent and thus is not a Constituent. So we have a conceptual problem here:

- on the one hand, we treat Token as a terminal in the constituency structure, which
means that there is an edge between the Token and the constituent above. Such an edge
should allow for a syntactic function label.

- on the other hand, Token *is not* a Constituent. It it was, it would automatically
inherit the "syntacticFunction" feature from the Constituent type.

So... is the Token a constituent or not?

If it is, then it should probably inherit from Consituent.

If it is not, then we should probably change our parser wrappers so, that an additional
terminal constituent is introduced in the constituency tree which can bear the syntactic
function that would otherwise be associated directly with the token. 

Original issue reported on code.google.com by richard.eckart on 2013-09-04 14:17:34

reckart commented 9 years ago
Nice summary.

In my world (TM), a Token is not a constituent.
So I would vote for introducing an additional terminal constituent and linking the
token to that if necessary.

Original issue reported on code.google.com by torsten.zesch on 2013-09-04 14:21:40

reckart commented 9 years ago
That would entail that we also remove the "parent" feature from the Token.

I slightly tend to adopt the view that a token is a part of the constituency tree (a
terminal node). The reason being this: 

If we create a kind of "pre-terminal" node in the constituency tree, what type/label
would that have? Looking at how some parsers are implemented, the "pre-terminal" node
bears the part-of-speech tag, while the terminal (the Token) is just of the text. In
DKPro Core, however, the part-of-speech is attached to the Token (yeah... my fault,
I know, but - as you too have noticed - very convenient). So the "pre-terminal" node
would either duplicate the POS tag information (not good imho) or just be an empty
dummy (likewise not so nice). 

I also think that removing "parent" from the token and introducing a pre-terminal may
also require more extensive changes to the code than deriving Token from Constituent.

Original issue reported on code.google.com by richard.eckart on 2013-09-04 14:27:02

reckart commented 9 years ago
Interesting, I hadn't even noticed the getParent() method so far.

It probably depends whether you have a parser-centric view or not.
In the other perspective, where tokens are created by a segmentation process, it makes
little sense to define a "parent" of a token.

I am a bit worried that with making token a constituent, we adopt this parser-centric
view which might have "interesting" consequences later.

Original issue reported on code.google.com by torsten.zesch on 2013-09-04 14:35:35

reckart commented 9 years ago
Fair point. However, we have a very real significant breaking of existing code and data
when changing the structure, but maybe don't break much or anything if we change the
inheritance. It's just intuition at the moment, a test would be required. If I am correct,
I'd prefer to break nothing/little now than break much now to avoid problems we may
or may not run into later... unless we have a clear picture what these problems would
be and how we assess them.

Original issue reported on code.google.com by richard.eckart on 2013-09-04 16:02:54

reckart commented 9 years ago
As am I not an ontologist, I am fine with not breaking things.

Original issue reported on code.google.com by torsten.zesch on 2013-09-04 16:43:43

reckart commented 9 years ago
(No text was entered with this change)

Original issue reported on code.google.com by richard.eckart on 2013-12-19 01:58:07

reckart commented 7 years ago

Added "syntacticFunction" feature to the token since it also has the "parent" feature and this is way of least resistance (i.e. nothing is likely going to break)