The parser and pos tagger component already uses the ptb3 escaper to sanitize e.g. (
or ) tokens before they get to the parser. But the StanfordPosTagger not yet - should
be largely a copy-paste from the StanfordParser component - however, the named entities
are currently detected over the document string, not over the tokens. We would need
to change it so that it operates on the tokens, which is only possible of offsets are
preserved by the Stanford classifier code.
Original issue reported on code.google.com by richard.eckart on 2014-05-16 09:45:19
Original issue reported on code.google.com by
richard.eckart
on 2014-05-16 09:45:19