explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Use sentencizer with stanfordnlp #24

Closed thomasthiebaud closed 11 months ago

thomasthiebaud commented 4 years ago

Right now spacy-stanfordnlp is taking care of the tokenization too. Would it be possible to use spacy' sentencizer and keeping stanfordnlp just for tagging and parsing?

I can only think about running two pipelines, the first one that only uses sentencizerand the second one that uses stanfordnlp.Pipeline. I will have a double tokenization, and probably a performance penalty

I'm getting through the doc and looking at the source code but can't find any proper way to do it

thomasthiebaud commented 4 years ago

It seems that Stanford NLP has a tokenize_pretokenized option. https://stanfordnlp.github.io/stanfordnlp/pipeline.html#running-on-pre-tokenized-text. I'll see if I can use that

adrianeboyd commented 11 months ago

Just going through some older issues, and it sounds like you found a solution. But please feel free to reopen if you're still running into issues!