By default turn off production of POS tags for all non-pos-taggers and non-readers

GoogleCodeExporter commented 9 years ago

When I build a pipeline with e.g. OpenNlpTagger and StanfordParser, then by 
default the StanfordParser will also add POS tags and I end up having two sets 
of POS annotations in the CAS with the one from the latter StanfordParser being 
attached to the Token.

I think as a user I would expect that the parser would not add another set of 
POS tags - and optimally - that it should use the POS tags already present in 
the CAS from the previous POS tagger. So I suggest to make these changes to the 
defaults of all non-pos-tagger/non-reader components:

PARAM_READ_POS: true
PARAM_WRITE_POS: false

However, mind that not all parsers support using pre-exising POS tags, e.g. 
BerkeleyParser afaik doesn't support that, while other parsers actually require 
pre-existing POS tags. So the changes above would mainly affect the following 
components:

BerkeleyParser: no longer produce POS tags by default
OpenNlpParser: no longer produce POS tags by default
StanfordParser: no longer produce POS tags by default, read in pre-existing POS 
tags by default.

What do you think?

Original issue reported on code.google.com by richard.eckart on 3 Jul 2014 at 6:44

GoogleCodeExporter commented 9 years ago

>> I think as a user I would expect that the parser would not add another set 
of POS tags

I agree -  without being fully familiar with the original (non-wrapped) parsers 
I would expect exactly that

>> However, mind that not all parsers support using pre-exising POS tags, e.g. 
BerkeleyParser afaik doesn't support that

what does "support using pre-exising POS tags" mean?
I guess it means, whether or not a parser uses pre-exising POS tags at all.

I think the most important question for a user is, if a parser _requires_ 
pre-exising POS tags or not.

A user with a strong linguistic background might also be interested to know, if 
a parser _is able_ to produce POS tags. Then the user can choose which 
component to use for the annotation of POS tags: a POS tagger or the parser 
(might depend on the task) - how can a user be informed of that capability?

>> BerkeleyParser: no longer produce POS tags by default
I do not get this - you wrote the BerkeleyParser doesn't support pre-exising 
POS tags?

Original comment by eckle.kohler on 3 Jul 2014 at 7:39

GoogleCodeExporter commented 9 years ago

just thought about what I wrote and am not sure if it hits the point / really 
makes sense:

in principle, the parsers that produce POS tags behave not as expected in a 
pipeline world where the annotations added by components are nicely assigned to 
different levels - and that's what is currently done:
the list of components available in Core is aligned with the different analysis 
levels

so it would be important to know for a user, if a particular component spans 
several analysis levels (as e.g. the Stanford parser does)

Original comment by eckle.kohler on 3 Jul 2014 at 7:51

GoogleCodeExporter commented 9 years ago

>> However, mind that not all parsers support using pre-exising POS tags, e.g. 
BerkeleyParser afaik doesn't support that

>what does "support using pre-exising POS tags" mean?
>I guess it means, whether or not a parser uses pre-exising POS tags at all.

It means whether a parser *can* use pre-existing POS tags. E.g. the Stanford 
parser can be configured to operate on pre-existing POS tags and to built its 
parse trees on them. It can also be configured to ignore pre-existing POS tags 
and the generate them as part of the parsing process. 

>> BerkeleyParser: no longer produce POS tags by default
> I do not get this - you wrote the BerkeleyParser doesn't support pre-exising 
POS tags?

BerkeleyParser, however, does afaik not allow to use pre-existing POS tags and 
will always generate them as part of the parsing process. But we can configure 
it not to write these POS tags to the CAS and instead leave pre-existing POS 
tags from a POS-taggeer in there. This might result in situations where the 
constituency tree and the POS tags are not properly in sync, e.g. a noun-phrase 
might consist of a single verb token because the POS tagger assigned the tag 
"verb" to the token while the parser thought it was a "noun" and built its 
constituency structure accordingly.

>I think the most important question for a user is, if a parser _requires_ 
pre-exising POS tags or not.

As far as I see, most dependency parsers require pre-exsisting POS tags, 
whereas constituency parsers usually do not.

>A user with a strong linguistic background might also be interested to know, 
if a parser _is able_ to produce POS tags. Then the user can choose which 
component to use for the annotation of POS tags: a POS tagger or the parser 
(might depend on the task) - how can a user be informed of that capability?

The user might notice that there is a parameter PARAM_WRITE_POS on the 
component which can be set to "true".

Original comment by richard.eckart on 3 Jul 2014 at 7:57

GoogleCodeExporter commented 9 years ago

It should be sensible that if a component has the options PARAM_READ_POS and 
PARAM_WRITE_POS, then PARAM_WRITE_POS should be automatically disabled when 
PARAM_READ_POS is enabled. E.g. a parser that consumes POS tags from a 
POS-tagger should not add them a second time just because they happen to be 
integrated into the parse trees generated by the parser.

Original comment by richard.eckart on 17 Aug 2014 at 4:05

Changed title: By default turn off production of POS tags for all non-pos-taggers and non-readers

GoogleCodeExporter commented 9 years ago

Issue 444 has been merged into this issue.

Original comment by richard.eckart on 22 Jan 2015 at 10:55

aminorex / dkpro-core-asl

By default turn off production of POS tags for all non-pos-taggers and non-readers #416