Open LuoweiZhou opened 6 years ago
https://github.com/stanfordnlp/CoreNLP/issues/668
If you start the server with a properties file, and put the property tokenize.whitespace = true it will tokenize on white space exclusively. So you can submit your string tokenized the way you want with whitespace splitting the tokens.
coreNLP could,But stanford-corenlp i don't if it can
You can process pretokenized text by setting tokenize.whitespace
and ssplit.eolonly
.
props={
'annotators': 'tokenize,ssplit,pos',
'tokenize.whitespace': 'true',
'ssplit.eolonly': 'true',
'pipelineLanguage': 'en',
'outputFormat': 'json'
}
Then separate tokens by whitespace and sentences by newlines
text = [['Hello', 'world', '!'], ['How', 'are', 'you', '?']]
input_text = "\n".join([" ".join(sentence) for sentence in text])
annotation = nlp.annotate(input_text, properties=props)
(Although I'm not sure anyone uses corenlp anymore 😅)
Does it support POS tagging for pre-tokenized text? As in here: https://nlp.stanford.edu/software/pos-tagger-faq.html#pretagged