Open Joselinejamy opened 5 years ago
@Joselinejamy nlp.pipe()
is a generator, so you're not actually executing the parser in the first block. I think that's why it seems faster: it's not actually doing the work. To make sure the parse is completed, you'll need something like:
snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)
word_count = 0
for doc in nlp.pipe(lines):
word_count += len(doc)
print(word_count)
The main efficiency problem we have at the moment is that we don't have support for batching the predictions and returning a Doc
object per item. We'd gladly accept a PR for this.
@honnibal Thank you for that instant response. But when i ran the below code with just spacy's model it took relatively less time around jus 5sec.
import spacy
start = time.time()
spacy_nlp = spacy.load('en')
for line in lines:
doc = spacy_nlp.pipe([line])
token_details = []
for sent in doc:
for tok in sent:
token_details.append([tok.text, tok.lemma_, tok.pos_])
print("Time taken : %f " % (time.time() - start))
As per the documentation,
If language data for the given language is available in spaCy, the respective language class will be used as the base for the nlp object – for example, English()
So when the same English object is used why is it taking much time ?. Or is my understanding diverged from what is intended ?
Hi, I'm also seeing a drastic performance decrease when using stanza. For a comparison, here's a project I'm working on, where I'm running a different number of parsers on over 6000 sentences. It can be seen that running CoreNLP 3 + CoreNLP 4 + spaCy roughly takes 8 times less time than running CoreNLP 3 + CoreNL4 + Stanza trough spacy_stanza.
Could this be GPU related as well ? These tests are run on a CPU, not GPU.
The stanza
models are just much slower than the typical spacy
core models. spacy-stanza
is just a wrapper that hooks stanza
into the tokenizer part of the spacy pipeline, so it looks like the pipeline components are the same as in a plain English()
model, but underneath the tokenizers are different. You can see:
import spacy
import stanza
import spacy_stanza
from spacy_stanza import StanzaLanguage
snlp = stanza.Pipeline(lang="en")
nlp_stanza = StanzaLanguage(snlp)
nlp_spacy = spacy.blank("en") # equivalent to English()
# both are the same type of Language pipeline
assert isinstance(nlp_stanza, spacy.language.Language)
assert isinstance(nlp_spacy, spacy.language.Language)
# both [] (no components beyond a tokenizer)
assert nlp_stanza.pipe_names == nlp_spacy.pipe_names
# however the tokenizers are completely different, and the
# spacy_stanza "tokenizer" is doing all the time-consuming stanza processing
assert isinstance(nlp_stanza.tokenizer, spacy_stanza.language.Tokenizer)
assert isinstance(nlp_spacy.tokenizer, spacy.tokenizer.Tokenizer)
And as Matt said above, there's no good batching solution for stanza
at the moment, so the speed difference between nlp_spacy.pipe()
and the spacy-stanza
pipeline is going to be even higher.
Hello, It takes too long to parse the doc object, i.e to iterate over sentence and tokens in them. Is that expected ?
The above code takes few milliseconds (apart from initialisation) to run over 500 sentences,
while this takes almost a minute(apart from initialisation) to run over 500 sentences
P.S : Have put nlp.pipe() inside a for loop intentionally to get all tokens for one sentence even though it gets segmented.