tagger.predict() is very slow

flairNLP / flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)

https://flairnlp.github.io/flair/

Other

13.88k stars 2.1k forks source link

tagger.predict() is very slow #7

Closed juggernauts closed 6 years ago

juggernauts commented 6 years ago

Compared to other deep learning based NER models, tagger.predict() appears to be slow. It took around 70 seconds to parse a string with 455 tokens.

Upon running line profiler, it seems all the time is spent in creating the embeddings

self.embeddings.embed(sentences)

Any ideas why this would be so slow?

alanakbik commented 6 years ago

Hello juggernauts, thanks for posting this!

Could you give me some details on how you are passing the string to the parser? Are you using sentence splitting and passing a list of sentences? Or are you putting it all into one Sentence object (which would be an extremely long sentence at 455 words)?

Could you perhaps post your entire code so I can reproduce?

alanakbik commented 6 years ago

The latest commit now includes automatic batching for parsing lists of sentences. Can you split your text into a list of sentences and try again?

juggernauts commented 6 years ago

You were right, I was passing the complete text without splitting it into sentences. With your latest commit, I was able to bring down the time to 18 seconds so I guess it did work. Here's my latest code


from flair.tagging_model import SequenceTagger
import nltk
sent_tokens = nltk.sent_tokenize("My long text of 455 words")
sentence = [Sentence(i) for i in sent_tokens]
tagger = SequenceTagger.load('ner')
tagger.predict(sentence)

alanakbik commented 6 years ago

Ok, that's great! Thanks for raising the issue and posting the code! we also expect upcoming releases to further improve on tagging speed - we'll keep you posted!

getshaun24 commented 5 years ago

You were right, I was passing the complete text without splitting it into sentences. With your latest commit, I was able to bring down the time to 18 seconds so I guess it did work. Here's my latest code
from flair.tagging_model import SequenceTagger
import nltk
sent_tokens = nltk.sent_tokenize("My long text of 455 words")
sentence = [Sentence(i) for i in sent_tokens]
tagger = SequenceTagger.load('ner')
tagger.predict(sentence)

Will this method of splitting sentences work with classification as well? When classifying tweets, would tokenizing sentences fracture the overall meaning of the tweet?

alanakbik commented 5 years ago

For full text classification you should not use sentence splitting, i.e. each tweet (or text paragraph you wish to classify) should get its own Sentence object. Since tweets are not long it should be OK, runtime-wise.

getshaun24 commented 5 years ago

Thanks Alan!

It is taking me about 7 seconds to predict a single tweet! This is wayyy to long when predicting over a large set.

Right now I am looping through an array of tweets and then predicting on each one.

Any suggestions on speeding this up?

I appreciate all your help :)

alanakbik commented 5 years ago

Hi @jewl123 yes that seems very slow. You are using a non-GPU setup, correct?

You can try to use mini-batching to speed things up - for instance you could pass a lists of 4, 8, 16 or 32 tweets at the same time, like this:

classifier = TextClassifier.load('en-sentiment')

# make mini-batch of sentences
sentences = [
    Sentence('I love this movie'),
    Sentence('I hate this movie'),
    Sentence('This movie is great'),
]

# pass mini-batch
classifier.predict(sentences)

for sentence in sentences:
    print(sentence)

getshaun24 commented 5 years ago

Thanks for the quick and concise responses Alan!

Yes it's on a CPU but when I use it on the cloud it does not seem much faster.

I will implement this now and let you know.

getshaun24 commented 5 years ago

Ahhhh I just realized that I had the "TextClassifier.load_from_file" call inside of my for loop! Moving it outside helped tremendously.

I will also try with mini-batches soon to see if it optimizes prediction time and report back.

Hope this helps someone later on