Long load time for en_core_web_md (more than 30 sec) leading to request timeout on Heroku

explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python

https://spacy.io

MIT License

30.21k stars 4.4k forks source link

Long load time for en_core_web_md (more than 30 sec) leading to request timeout on Heroku #3054

Closed mahoriR closed 5 years ago

mahoriR commented 5 years ago

How to reproduce the behaviour

Run Django with gunicorn on Heroku (Gunicorn is not mandatory but that's my setup )
See that en_core_web_md.load() takes long time leading to request timeouts.

Your Environment

Linux on Heroku (2GB)
Python 3.5
spaCy v2.0.18

ned2 commented 5 years ago

You definitely don't want to be loading your model during that handling of a request. Load it during your app's setup and have your request handling code access it through a global etc.

mahoriR commented 5 years ago

Thanks @ned2 . The framework creates additional workers at run time and they are all new proceses who do not share code/memory.

Little off topic question here -- the en_core_web_sm takes really small time to load but i really need the vectors from _md. Is it posssible to extract those from from _md and add to _sm and create a new model? And how do you anticipate the load time for this custom model?

Thanks!

ned2 commented 5 years ago

You can tell gunicorn to load the application into memory before forking worker processes with the --prefork option, which could be helpful for your situation.

On Sun, Dec 16, 2018, 21:58 iccyrave <notifications@github.com wrote:

Thanks @ned2 https://github.com/ned2 . The framework creates additional workers at run time and they are all new proceses who do not share code/memory.

Little off topic question here -- the en_core_web_sm takes really small time to load but i really need the vectors from _md. Is it posssible to extract those from from _md and add to _sm and create a new model? And how do you anticipate the load time for this custom model?

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/explosion/spaCy/issues/3054#issuecomment-447633877, or mute the thread https://github.com/notifications/unsubscribe-auth/ACs1GCYBQ3b7aPHR2WkGeHx_xVcqn1dXks5u5ic9gaJpZM4ZUuRx .

ines commented 5 years ago

Little off topic question here -- the en_core_web_sm takes really small time to load but i really need the vectors from _md. Is it posssible to extract those from from _md and add to _sm and create a new model?

The vectors are most likely the reason for the longer loading time, so not sure if this will help. But saving a model without all the other pipeline components could definitely make the processing faster.

You could, for instance, disable all pipeline components and then save out the model:

nlp = spacy.load('en_core_web_md')
with nlp.disable_pipes('tagger', 'parser', 'ner'):
    nlp.to_disk('/path/to/new_model')

In your code, you could then load the model from a directory:

nlp = spacy.load('/path/to/new_model')

You could also use spacy package to turn it into a Python package, whichever works best for your setup.

mahoriR commented 5 years ago

@ned2 Did use prefork and abstracted all nlp to specific workers so that spacy stuff doesnt come in path of requests-resp cycle as well as can use bigger dyno just for spacy specific worker. So this kind of works.. (still celery forks worker processes for workers at runtime so first request to celery task actually takes long but that's fine!) Thanks for your inputs.

mahoriR commented 5 years ago

@ines Would try and get back with results. Really love the work you guys are doing at spacy!

ines commented 5 years ago

Thanks! 😃 I'm closing this issue since the main question has been answered – there's probably still some optimisation that can be done, but a lot of this will come down to the specifics of managing the workers and doing the Heroku setup, which is all pretty independent of spaCy.

lock[bot] commented 5 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.