emorynlp / nlp4j-old

NLP tools developed by Emory University.
Other
60 stars 19 forks source link

Garbage tokens? #24

Closed frankandrobot closed 8 years ago

frankandrobot commented 8 years ago

Just ran the pos tagger, using the code below. Unfortunately, the first token seems to be garbage. Can I always assume that this will be the case?

    val config = DecodeConfig(IOUtils.createFileInputStream(configUri))
    val decoder = NLPDecoder(config)
    val tokens = decoder.decode("For god so loved.")

    for(p in tokens) println(p)

Output:

0   @#r$%   @#r$%   @#r$%   _   _   _   _   @#r$%
1   For for IN  _   _   _   _   @#r$%
2   god god NN  pos2=UH _   _   _   @#r$%
3   so  so  RB  _   _   _   _   @#r$%
4   loved   love    VBD pos2=VBN    _   _   _   @#r$%
5   .   .   .   _   _   _   _   @#r$%
onedash commented 8 years ago

It's not an issue. Token with ID=0 is an artificial root (it's very useful in dependency parsing for example).

frankandrobot commented 8 years ago

Yes that answers my question... when I get POS tags, I just need to filter out the first token.