dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

PennTreeBank Reader for tagged corpora #439

Closed reckart closed 9 years ago

reckart commented 9 years ago
DKPro has yet no reader that can read the tagged plain-text corpora that comes along
with the PTB.

Points for discussion:
- corpora contain noun phrase annotations (in addition to the tags), is there a type
to annotate noun phrases in DKPro?

- Tokens have occasionally two or more possible part of speech tags in case of ambiguity,
how to deal with those. Take only the first one?

- The switchboard corpus in PTB has additionally wrongly tagged words marked, how to
deal with those. Is there a 'no-tag' attribute value for a UIMA-Pos type

Original issue reported on code.google.com by Tobias.Horsmann on 2014-08-01 11:12:50