dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

By default turn off production of POS tags for all non-pos-taggers and non-readers #416

Closed reckart closed 8 years ago

reckart commented 9 years ago

When I build a pipeline with e.g. OpenNlpTagger and StanfordParser, then by default the StanfordParser will also add POS tags and I end up having two sets of POS annotations in the CAS with the one from the latter StanfordParser being attached to the Token.

I think as a user I would expect that the parser would not add another set of POS tags - and optimally - that it should use the POS tags already present in the CAS from the previous POS tagger. So I suggest to make these changes to the defaults of all non-pos-tagger/non-reader components:

However, mind that not all parsers support using pre-exising POS tags, e.g. BerkeleyParser afaik doesn't support that, while other parsers actually require pre-existing POS tags. So the changes above would mainly affect the following components:

What do you think?

Original issue reported on code.google.com by richard.eckart on 2014-07-03 06:44:39

reckart commented 9 years ago
>> I think as a user I would expect that the parser would not add another set of POS
tags

I agree -  without being fully familiar with the original (non-wrapped) parsers I would
expect exactly that

>> However, mind that not all parsers support using pre-exising POS tags, e.g. BerkeleyParser
afaik doesn't support that

what does "support using pre-exising POS tags" mean?
I guess it means, whether or not a parser uses pre-exising POS tags at all.

I think the most important question for a user is, if a parser _requires_ pre-exising
POS tags or not.

A user with a strong linguistic background might also be interested to know, if a parser
_is able_ to produce POS tags. Then the user can choose which component to use for
the annotation of POS tags: a POS tagger or the parser (might depend on the task) -
how can a user be informed of that capability?

>> BerkeleyParser: no longer produce POS tags by default
I do not get this - you wrote the BerkeleyParser doesn't support pre-exising POS tags?

Original issue reported on code.google.com by eckle.kohler on 2014-07-03 07:39:57

reckart commented 9 years ago
just thought about what I wrote and am not sure if it hits the point / really makes
sense:

in principle, the parsers that produce POS tags behave not as expected in a pipeline
world where the annotations added by components are nicely assigned to different levels
- and that's what is currently done:
the list of components available in Core is aligned with the different analysis levels

so it would be important to know for a user, if a particular component spans several
analysis levels (as e.g. the Stanford parser does)

Original issue reported on code.google.com by eckle.kohler on 2014-07-03 07:51:37

reckart commented 9 years ago
>> However, mind that not all parsers support using pre-exising POS tags, e.g. BerkeleyParser
afaik doesn't support that

>what does "support using pre-exising POS tags" mean?
>I guess it means, whether or not a parser uses pre-exising POS tags at all.

It means whether a parser *can* use pre-existing POS tags. E.g. the Stanford parser
can be configured to operate on pre-existing POS tags and to built its parse trees
on them. It can also be configured to ignore pre-existing POS tags and the generate
them as part of the parsing process. 

>> BerkeleyParser: no longer produce POS tags by default
> I do not get this - you wrote the BerkeleyParser doesn't support pre-exising POS
tags?

BerkeleyParser, however, does afaik not allow to use pre-existing POS tags and will
always generate them as part of the parsing process. But we can configure it not to
write these POS tags to the CAS and instead leave pre-existing POS tags from a POS-taggeer
in there. This might result in situations where the constituency tree and the POS tags
are not properly in sync, e.g. a noun-phrase might consist of a single verb token because
the POS tagger assigned the tag "verb" to the token while the parser thought it was
a "noun" and built its constituency structure accordingly.

>I think the most important question for a user is, if a parser _requires_ pre-exising
POS tags or not.

As far as I see, most dependency parsers require pre-exsisting POS tags, whereas constituency
parsers usually do not.

>A user with a strong linguistic background might also be interested to know, if a
parser _is able_ to produce POS tags. Then the user can choose which component to use
for the annotation of POS tags: a POS tagger or the parser (might depend on the task)
- how can a user be informed of that capability?

The user might notice that there is a parameter PARAM_WRITE_POS on the component which
can be set to "true".

Original issue reported on code.google.com by richard.eckart on 2014-07-03 07:57:17

reckart commented 9 years ago
It should be sensible that if a component has the options PARAM_READ_POS and PARAM_WRITE_POS,
then PARAM_WRITE_POS should be automatically disabled when PARAM_READ_POS is enabled.
E.g. a parser that consumes POS tags from a POS-tagger should not add them a second
time just because they happen to be integrated into the parse trees generated by the
parser.

Original issue reported on code.google.com by richard.eckart on 2014-08-17 16:05:46

reckart commented 9 years ago
Issue 444 has been merged into this issue.

Original issue reported on code.google.com by richard.eckart on 2015-01-22 22:55:18

reckart commented 8 years ago

Changed OpenNlpParser to not create POS tags by default.

Leaving SfstAnnotator to create POS tags by default. Not sure if this is a good idea though... let's see.