explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.63k stars 4.36k forks source link

pyinstaller package size for spacy has significantly increased #4548

Closed erotavlas closed 4 years ago

erotavlas commented 4 years ago

I noticed when creating a pyinstaller executable from my spacy python scripts, that the size of the resulting package is very large mostly due to the inclusion of something called 'mkl' (whose 'dll's take up over 500MB of the package. When I created my original pyinstaller package using spacy version 2.1.4 those were not present. Although there may be additional packages which are contributing to the file size increase.

EDIT: I found that a folder called 'torch' takes up the most space (over 50 %) and is being included although I don't see any package relating to pytorch installed in the environment

This was definitely not present before when I was first using pyinstaller to create spacy executables sometime between spacy versions 2.0.16 and 2.1.4

Here is a breakdown of what is taking up the most space in the pyintaller output folder

Untitled

I went from something around 40MB to 1.8GB which is pretty significant.

I found another reference to this in the following regarding mkl in particular https://github.com/conda-forge/numpy-feedstock/issues/84

I'm using Anaconda environment with spacy installed using conda install spacy==2.1.8

below are the packages installed to my environment

via conda install spacy==2.1.8

# packages in environment at D:\Anaconda\envs\spacy218_temp2:
#
# Name                    Version                   Build  Channel
asn1crypto                1.2.0                    py36_0    anaconda
attrs                     19.3.0                     py_0    anaconda
blas                      1.0                         mkl    anaconda
ca-certificates           2019.10.16                    0    anaconda
certifi                   2019.9.11                py36_0    anaconda
cffi                      1.12.3           py36h7a1dbc1_0    anaconda
chardet                   3.0.4                 py36_1003    anaconda
cryptography              2.7              py36h7a1dbc1_0    anaconda
cymem                     2.0.2            py36h74a9793_0    anaconda
cython-blis               0.2.4            py36hfa6e2cd_1    conda-forge
icc_rt                    2019.0.0             h0cc432a_1    anaconda
idna                      2.8                      py36_0    anaconda
importlib_metadata        0.23                     py36_0    anaconda
intel-openmp              2019.5                      281    anaconda
jsonschema                3.1.1                    py36_0    anaconda
mkl                       2019.5                      281    anaconda
mkl_fft                   1.0.12           py36h14836fe_0    anaconda
mkl_random                1.0.2            py36h343c172_0    anaconda
more-itertools            7.2.0                    py36_0    anaconda
murmurhash                1.0.2            py36h33f27b4_0    anaconda
numpy                     1.16.4           py36h19fb1c0_0    anaconda
numpy-base                1.16.4           py36hc3f5095_0    anaconda
openssl                   1.1.1                he774522_0    anaconda
pip                       19.3.1                   py36_0    anaconda
plac                      0.9.6                    py36_0    anaconda
preshed                   2.0.1            py36h33f27b4_0    anaconda
pycparser                 2.19                     py36_0    anaconda
pyopenssl                 19.0.0                   py36_0    anaconda
pyrsistent                0.14.11          py36he774522_0    anaconda
pysocks                   1.7.1                    py36_0    anaconda
python                    3.6.9                h5500b2f_0    anaconda
requests                  2.22.0                   py36_0    anaconda
setuptools                41.4.0                   py36_0    anaconda
six                       1.12.0                   py36_0    anaconda
spacy                     2.1.8            py36he980bc4_0    conda-forge
sqlite                    3.29.0               he774522_0    anaconda
srsly                     0.1.0            py36h6538335_0    conda-forge
thinc                     7.0.8            py36he980bc4_0    conda-forge
tqdm                      4.36.1                     py_0    anaconda
urllib3                   1.24.2                   py36_0    anaconda
vc                        14.1                 h21ff451_3    anaconda
vs2015_runtime            15.5.2                        3    anaconda
wasabi                    0.3.0                      py_0    conda-forge
wheel                     0.33.6                   py36_0    anaconda
win_inet_pton             1.1.0                    py36_0    anaconda
wincertstore              0.2              py36h7fe50ca_0    anaconda
zipp                      0.6.0                      py_0    anaconda

My pyinstaller command for 2.1.8 is

pyinstaller ner.py --hidden-import cymem.cymem --hidden-import thinc.linalg --hidden-import murmurhash.mrmr --hidden-import cytoolz.utils --hidden-import cytoolz._signatures --hidden-import spacy.strings --hidden-import spacy.morphology --hidden-import spacy.lexeme --hidden-import spacy.tokens --hidden-import spacy.gold --hidden-import spacy.tokens.underscore --hidden-import spacy.parts_of_speech --hidden-import dill --hidden-import spacy.tokens.printers --hidden-import spacy.tokens._retokenize --hidden-import spacy.syntax --hidden-import spacy.syntax.stateclass --hidden-import spacy.syntax.transition_system --hidden-import spacy.syntax.nonproj --hidden-import spacy.syntax.nn_parser --hidden-import spacy.syntax.arc_eager --hidden-import thinc.extra.search --hidden-import spacy.syntax._beam_utils --hidden-import spacy.syntax.ner --hidden-import thinc.neural._classes.difference --hidden-import spacy.vocab --hidden-import spacy.lemmatizer --hidden-import spacy._ml --hidden-import spacy.lang.en --hidden-import srsly.msgpack.util --hidden-import preshed.maps --hidden-import thinc.neural._aligned_alloc --hidden-import blis --hidden-import blis.py --hidden-import spacy.matcher._schemas --hidden-import spacy._align --hidden-import spacy.syntax._parser_model --hidden-import spacy.kb

EDIT: I am unable to recreate my original package (size of 42MB) I think because every time I try to install spacy (earlier version such as 2.1.4) it still installs the latest of most other packages,, and somehow torch gets included somehow as well.

ines commented 4 years ago

Does PyInstaller look at the modules and guess the dependencies based on the imports? PyTorch isn't a dependency of spaCy itself, but spaCy does ship a few functions that import torch within the function body, so you can use them with PyTorch if you have it installed. There's also a try/except block that imports torch if it's installed for extra functionality. So maybe PyInstaller thinks that spaCy requires torch, even though it doesn't?

erotavlas commented 4 years ago

Yes if its listed as an import somewhere, then pyinstaller will automatically bundle it into the package (it scans a script and finds all its imports, and imports within those etc.).
So I guess that import torch statement explains why its being included.

I found there is an option --exclude-module that I can try to ignore torch.

What about the mkl package? Were there any changes regarding this?

erotavlas commented 4 years ago

I rebuilt the package and the exclude worked - without torch the package size reduces to around 500MB which is still large. I think the remaining issue is the presence of the 'mkl' package. Any ideas about that?

EDIT: I created a new environment but instead of installing spacy through anaconda I used pip. First I installed numpy via pip, then I installed spacy version 2.1.8 using pip. After running puinstaller my package size was down to just over 70 MB and the mkl packages are no longer there.

The anaconda version of spacy seems to be including the mkl instead of openblas.

ines commented 4 years ago

Glad it worked – thanks for updating!

And yes, numpy installed via Anaconda seems to default to MKL instead of OpenBLAS. I think you can configure that during installation. Or alternatively, installing from pip works as well.

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.