explosion / spacy-stanza

💥 Use the latest Stanza (StanfordNLP) research models directly in spaCy
MIT License
723 stars 59 forks source link

Takes too long to parse doc results #18

Open Joselinejamy opened 5 years ago

Joselinejamy commented 5 years ago

Hello, It takes too long to parse the doc object, i.e to iterate over sentence and tokens in them. Is that expected ?

snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)

for line in lines:
    doc = nlp.pipe([line])

The above code takes few milliseconds (apart from initialisation) to run over 500 sentences,

snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)

for line in lines:
    doc = nlp.pipe([line])
    token_details = []

    for sents in doc:
        for tok in sents:
            token_details.append([tok.text, tok.lemma_, tok.pos_])

while this takes almost a minute(apart from initialisation) to run over 500 sentences

P.S : Have put nlp.pipe() inside a for loop intentionally to get all tokens for one sentence even though it gets segmented.

honnibal commented 5 years ago

@Joselinejamy nlp.pipe() is a generator, so you're not actually executing the parser in the first block. I think that's why it seems faster: it's not actually doing the work. To make sure the parse is completed, you'll need something like:


snlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=model_dir)
nlp = StanfordNLPLanguage(snlp)

word_count = 0
for doc in nlp.pipe(lines):
    word_count += len(doc)
print(word_count)

The main efficiency problem we have at the moment is that we don't have support for batching the predictions and returning a Doc object per item. We'd gladly accept a PR for this.

Joselinejamy commented 5 years ago

@honnibal Thank you for that instant response. But when i ran the below code with just spacy's model it took relatively less time around jus 5sec.

import spacy
start = time.time()
spacy_nlp = spacy.load('en')
for line in lines:
    doc = spacy_nlp.pipe([line])
    token_details = []

    for sent in doc:
        for tok in sent:
            token_details.append([tok.text, tok.lemma_, tok.pos_])

print("Time taken : %f " % (time.time() - start))

As per the documentation,

If language data for the given language is available in spaCy, the respective language class will be used as the base for the nlp object – for example, English()

So when the same English object is used why is it taking much time ?. Or is my understanding diverged from what is intended ?

diegollarrull commented 4 years ago

Hi, I'm also seeing a drastic performance decrease when using stanza. For a comparison, here's a project I'm working on, where I'm running a different number of parsers on over 6000 sentences. It can be seen that running CoreNLP 3 + CoreNLP 4 + spaCy roughly takes 8 times less time than running CoreNLP 3 + CoreNL4 + Stanza trough spacy_stanza.

Screen Shot 2020-06-20 at 20 02 12

Could this be GPU related as well ? These tests are run on a CPU, not GPU.

adrianeboyd commented 4 years ago

The stanza models are just much slower than the typical spacy core models. spacy-stanza is just a wrapper that hooks stanza into the tokenizer part of the spacy pipeline, so it looks like the pipeline components are the same as in a plain English() model, but underneath the tokenizers are different. You can see:

import spacy
import stanza
import spacy_stanza
from spacy_stanza import StanzaLanguage

snlp = stanza.Pipeline(lang="en")
nlp_stanza = StanzaLanguage(snlp)

nlp_spacy = spacy.blank("en") # equivalent to English()

# both are the same type of Language pipeline
assert isinstance(nlp_stanza, spacy.language.Language)
assert isinstance(nlp_spacy, spacy.language.Language)

# both [] (no components beyond a tokenizer)
assert nlp_stanza.pipe_names == nlp_spacy.pipe_names

# however the tokenizers are completely different, and the
# spacy_stanza "tokenizer" is doing all the time-consuming stanza processing
assert isinstance(nlp_stanza.tokenizer, spacy_stanza.language.Tokenizer)
assert isinstance(nlp_spacy.tokenizer, spacy.tokenizer.Tokenizer)

And as Matt said above, there's no good batching solution for stanza at the moment, so the speed difference between nlp_spacy.pipe() and the spacy-stanza pipeline is going to be even higher.