Lynten / stanford-corenlp

Python wrapper for Stanford CoreNLP.
MIT License
919 stars 200 forks source link

POS tagging for pre-tokenized text #57

Open LuoweiZhou opened 6 years ago

LuoweiZhou commented 6 years ago

Does it support POS tagging for pre-tokenized text? As in here: https://nlp.stanford.edu/software/pos-tagger-faq.html#pretagged

sherlockhoatszx commented 5 years ago

https://github.com/stanfordnlp/CoreNLP/issues/668

If you start the server with a properties file, and put the property tokenize.whitespace = true it will tokenize on white space exclusively. So you can submit your string tokenized the way you want with whitespace splitting the tokens.

coreNLP could,But stanford-corenlp i don't if it can

ianporada commented 3 weeks ago

You can process pretokenized text by setting tokenize.whitespace and ssplit.eolonly.

props={
    'annotators': 'tokenize,ssplit,pos',
    'tokenize.whitespace': 'true',
    'ssplit.eolonly': 'true',
    'pipelineLanguage': 'en',
    'outputFormat': 'json'
}

Then separate tokens by whitespace and sentences by newlines

text = [['Hello', 'world', '!'], ['How', 'are', 'you', '?']] 
input_text = "\n".join([" ".join(sentence) for sentence in text])
annotation = nlp.annotate(input_text, properties=props)

(Although I'm not sure anyone uses corenlp anymore 😅)