Closed thomasthiebaud closed 1 year ago
It seems that Stanford NLP has a tokenize_pretokenized
option. https://stanfordnlp.github.io/stanfordnlp/pipeline.html#running-on-pre-tokenized-text. I'll see if I can use that
Just going through some older issues, and it sounds like you found a solution. But please feel free to reopen if you're still running into issues!
Right now
spacy-stanfordnlp
is taking care of the tokenization too. Would it be possible to use spacy'sentencizer
and keepingstanfordnlp
just for tagging and parsing?I can only think about running two pipelines, the first one that only uses
sentencizer
and the second one that usesstanfordnlp.Pipeline
. I will have a double tokenization, and probably a performance penaltyI'm getting through the doc and looking at the source code but can't find any proper way to do it