ContinuumIO / anaconda-issues

Anaconda issue tracking
646 stars 220 forks source link

Package and environment size #8242

Open kalefranz opened 6 years ago

kalefranz commented 6 years ago

From @mrocklin on January 19, 2018 21:25

I sometimes want to put conda environments in docker containers. Unfortunately, installing the classic scientific Python stack results in docker image sizes that are easily 1-3GB in size, which can be problematic when moving them around.

Example environment

Here is an example environment that I care about today:

conda create -n test-defaults  cytoolz dask distributed fastparquet git ipywidgets jupyterlab matplotlib nb_conda_kernels netcdf4 nomkl numba numpy pandas python-blosc scipy xarray zict

This takes up 1.3GB of space.

mrocklin@carbon:~/Software/anaconda/envs$ du -hs test-defaults/
1.3G    test-defaults/

I'm curious what all parts of the community can do to reduce this amount, including both package builders and library authors.

Example offenders, pandas and scipy

Some of the biggest offenders include scipy, and pandas.

mrocklin@carbon:~/Software/anaconda/envs$ du -hs test-defaults/lib/python3.6/site-packages/{scipy,pandas}
62M test-defaults/lib/python3.6/site-packages/scipy
53M test-defaults/lib/python3.6/site-packages/pandas

This is worse on conda-forge

mrocklin@carbon:~/Software/anaconda/envs$ du -hs test-conda-forge/lib/python3.6/site-packages/{scipy,pandas}
148M    test-size/lib/python3.6/site-packages/scipy
109M    test-size/lib/python3.6/site-packages/pandas

And I suspect that this does not fully encompass the problem. Anecdotally I've noticed that avoiding and then installing scipy results in a 400MB increase, so presumably there are a number of dependency packages that were brought on.

pip

Pip packages seem to be smaller, but are still unpleasantly large. An environment similar to the one above consumes around 800MB on my machine.

What's going on?

Is all of this compiled code? If so is it expected to get hundreds of megabytes of binary artifacts? Even given the code complexity of these projects it surprises me that they can not be expressed more concisely than hundreds of megabytes. Are we including some dependencies multiple times? Are we accidentally shipping test data around? What is going on? How can we help reduce the size, in bytes, of the scipy stack?

Copied from original issue: conda/conda#6756

kalefranz commented 6 years ago

From @seibert on January 19, 2018 21:31

Probably the largest single space consumer in a conda environment is MKL, which is linked to NumPy in Anaconda.

On linux-64, the unpacked package sizes have been slowly growing over time:

408M    mkl-11.3.3-0
459M    mkl-2017.0.1-0
465M    mkl-2017.0.3-0
648M    mkl-2017.0.4-h4c4d0af_0
648M    mkl-2018.0.0-hb491cac_4
699M    mkl-2018.0.1-h19d6760_4

The large size is because MKL ships with many implementations for different Intel architectures. Unfortunately, I don't think MKL supports stripping out unneeded architectures.

kalefranz commented 6 years ago

From @seibert on January 19, 2018 21:33

@kalefranz, @msarahan: The jump between the old conda-build and the new conda-build version of MKL looks suspicious. Did something in the packaging of MKL change when updating the recipes?

kalefranz commented 6 years ago

From @mrocklin on January 19, 2018 21:35

In the example above I explicitly avoided mkl . However perhaps other libraries are large for similar reasons?

On Jan 19, 2018 4:31 PM, "Stan Seibert" notifications@github.com wrote:

Probably the largest single space consumer in a conda environment is MKL, which is linked to NumPy in Anaconda.

On linux-64, the unpacked package sizes have been slowly growing over time: 408M mkl-11.3.3-0 459M mkl-2017.0.1-0 465M mkl-2017.0.3-0 648M mkl-2017.0.4-h4c4d0af_0 648M mkl-2018.0.0-hb491cac_4 699M mkl-2018.0.1-h19d6760_4

The large size is because MKL ships with many implementations for different Intel architectures. Unfortunately, I don't think MKL support stripping out unneeded architectures.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/conda/conda/issues/6756#issuecomment-359094715, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHhwfvfyINw1fqitQUBerpvOcf9Zks5tMQm3gaJpZM4RlEgo .

kalefranz commented 6 years ago

From @seibert on January 19, 2018 21:35

Oh, sorry, I missed the nomkl in the install command. MKL is uniquely strange because it is engineered for per-architecture dispatch. I don't think any other packages have this problem.

kalefranz commented 6 years ago

From @mingwandroid on January 19, 2018 21:38

The binaries are not stripped:

ls -l ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
-rwxrwxr-x 2 root root 3058248 Nov  7 18:15 ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
strip ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
ls -l ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
-rwxrwxr-x 2 root root 1708976 Jan 19 21:37 ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
kalefranz commented 6 years ago

From @mingwandroid on January 19, 2018 21:45

But stripping them all only saves about 6MB. It seems the shared libraries are chock full of instruction code:

size cython_special.cpython-36m-x86_64-linux-gnu.so
   text    data     bss     dec     hex filename
3616927   75408    6264 3698599  386fa7 cython_special.cpython-36m-x86_64-linux-gnu.so
kalefranz commented 6 years ago

From @mingwandroid on January 19, 2018 21:48

These are Cython-generated binaries at the end of the day, some tips on reducing their size: https://gist.github.com/tito/9414743

kalefranz commented 6 years ago

From @mingwandroid on January 19, 2018 22:13

Looking into one of them and sorting the 10 largest functions by size:

nm -SD --size-sort cython_special.cpython-36m-x86_64-linux-gnu.so | tail -n 10
00000000002471b0 0000000000003346 T zbknu_
00000000002bb970 0000000000003733 T ciknb_
000000000032adb0 00000000000037d4 T cikva_
0000000000327470 0000000000003934 T cikvb_
00000000002b7e10 0000000000003b55 T cjynb_
00000000002e8560 0000000000003d4a T cjyvb_
000000000030ba00 00000000000041e3 T cjyna_
0000000000303920 0000000000006042 T hygfz_
00000000002df8a0 0000000000008cb8 T cjyva_
0000000000105b70 000000000012e6ac T PyInit_cython_special

PyInit_cython_special is huge; more than 1MB.

kalefranz commented 6 years ago

From @mingwandroid on January 19, 2018 22:24

Disassembling this function:

 497:   48 c7 84 24 f0 02 00    movq   $0x0,0x2f0(%rsp)
 49e:   00 00 00 00 00
 4a3:   48 c7 44 24 40 00 00    movq   $0x0,0x40(%rsp)
 4aa:   00 00
 4ac:   48 c7 84 24 f8 02 00    movq   $0x0,0x2f8(%rsp)
 4b3:   00 00 00 00 00
 4b8:   48 c7 84 24 98 00 00    movq   $0x0,0x98(%rsp)
 4bf:   00 00 00 00 00
 4c4:   48 c7 84 24 00 03 00    movq   $0x0,0x300(%rsp)
 4cb:   00 00 00 00 00
 4d0:   48 c7 84 24 90 00 00    movq   $0x0,0x90(%rsp)
 4d7:   00 00 00 00 00
 4dc:   48 c7 84 24 08 03 00    movq   $0x0,0x308(%rsp)
 4e3:   00 00 00 00 00
 4e8:   48 c7 44 24 38 00 00    movq   $0x0,0x38(%rsp)
 4ef:   00 00
 4f1:   48 c7 84 24 10 03 00    movq   $0x0,0x310(%rsp)
 4f8:   00 00 00 00 00
 4fd:   48 c7 44 24 30 00 00    movq   $0x0,0x30(%rsp)
 504:   00 00
 506:   48 c7 84 24 18 03 00    movq   $0x0,0x318(%rsp)
 50d:   00 00 00 00 00
 512:   48 c7 44 24 28 00 00    movq   $0x0,0x28(%rsp)
 519:   00 00
 51b:   48 c7 84 24 20 03 00    movq   $0x0,0x320(%rsp)
 522:   00 00 00 00 00
 527:   48 c7 44 24 20 00 00    movq   $0x0,0x20(%rsp)
 52e:   00 00
 530:   48 c7 84 24 28 03 00    movq   $0x0,0x328(%rsp)
 537:   00 00 00 00 00
 53c:   48 c7 44 24 18 00 00    movq   $0x0,0x18(%rsp)
 543:   00 00
 545:   48 c7 84 24 30 03 00    movq   $0x0,0x330(%rsp)
 54c:   00 00 00 00 00
 551:   48 c7 44 24 10 00 00    movq   $0x0,0x10(%rsp)
 558:   00 00
 55a:   48 c7 84 24 38 03 00    movq   $0x0,0x338(%rsp)
 561:   00 00 00 00 00
 566:   48 c7 44 24 08 00 00    movq   $0x0,0x8(%rsp)
 56d:   00 00
 56f:   48 c7 84 24 40 03 00    movq   $0x0,0x340(%rsp)
 576:   00 00 00 00 00
 57b:   48 c7 84 24 48 03 00    movq   $0x0,0x348(%rsp)
 582:   00 00 00 00 00
 587:   48 c7 84 24 50 03 00    movq   $0x0,0x350(%rsp)
 58e:   00 00 00 00 00
 593:   48 c7 84 24 58 03 00    movq   $0x0,0x358(%rsp)
 59a:   00 00 00 00 00
 59f:   48 c7 84 24 60 03 00    movq   $0x0,0x360(%rsp)
 5a6:   00 00 00 00 00
 5ab:   48 c7 84 24 68 03 00    movq   $0x0,0x368(%rsp)
 5b2:   00 00 00 00 00
 5b7:   48 c7 04 24 00 00 00    movq   $0x0,(%rsp)
 5be:   00
 5bf:   48 c7 84 24 70 03 00    movq   $0x0,0x370(%rsp)
 5c6:   00 00 00 00 00
 5cb:   48 c7 84 24 88 00 00    movq   $0x0,0x88(%rsp)
 5d2:   00 00 00 00 00
 5d7:   48 c7 84 24 78 03 00    movq   $0x0,0x378(%rsp)
 5de:   00 00 00 00 00
 5e3:   48 c7 84 24 80 00 00    movq   $0x0,0x80(%rsp)
 5ea:   00 00 00 00 00
 5ef:   48 c7 84 24 80 03 00    movq   $0x0,0x380(%rsp)
 5f6:   00 00 00 00 00
 5fb:   48 c7 44 24 78 00 00    movq   $0x0,0x78(%rsp)
 602:   00 00
 604:   48 c7 84 24 88 03 00    movq   $0x0,0x388(%rsp)
 60b:   00 00 00 00 00
 610:   48 c7 44 24 70 00 00    movq   $0x0,0x70(%rsp)
 617:   00 00
 619:   48 c7 84 24 90 03 00    movq   $0x0,0x390(%rsp)
 620:   00 00 00 00 00
 625:   48 c7 44 24 68 00 00    movq   $0x0,0x68(%rsp)
 62c:   00 00
 62e:   48 c7 84 24 98 03 00    movq   $0x0,0x398(%rsp)
 635:   00 00 00 00 00
 63a:   48 c7 44 24 60 00 00    movq   $0x0,0x60(%rsp)
 641:   00 00
 643:   48 c7 84 24 a0 03 00    movq   $0x0,0x3a0(%rsp)
 64a:   00 00 00 00 00
 64f:   48 c7 44 24 58 00 00    movq   $0x0,0x58(%rsp)
 656:   00 00
 658:   48 c7 84 24 a8 03 00    movq   $0x0,0x3a8(%rsp)
 65f:   00 00 00 00 00
 664:   48 c7 44 24 50 00 00    movq   $0x0,0x50(%rsp)
 66b:   00 00
 66d:   48 c7 84 24 b0 03 00    movq   $0x0,0x3b0(%rsp)
 674:   00 00 00 00 00
 679:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
 680:   4d 85 d2                test   %r10,%r10
 683:   74 20                   je     106215 <PyInit_cython_special+0x6a5>
 685:   49 83 2a 01             subq   $0x1,(%r10)
 689:   75 1a                   jne    106215 <PyInit_cython_special+0x6a5>
 68b:   49 8b 42 08             mov    0x8(%r10),%rax
 68f:   4c 89 84 24 b8 03 00    mov    %r8,0x3b8(%rsp)

.. lots and lots of this clearing of structures on the very stack slowly.

kalefranz commented 6 years ago

From @mrocklin on January 19, 2018 22:28

cc @jreback @tomaugspurger for Pandas things

@mingwandroid any thoughts on other offending packages? I would expect a different story from SciPy. also cc @stefanv

kalefranz commented 6 years ago

From @mingwandroid on January 19, 2018 22:34

This is scipy. I would be interested to see if -Os makes a dent in this and in general why all of these movq instructions are not being coalesced into a memset.

kalefranz commented 6 years ago

From @TomAugspurger on January 19, 2018 22:44

For the sdist, pandas includes a bunch of data files that are only used for testing. I don't think they should be included in source or binary distributions.

Other than that, the largest files are .c files for sdists and .so (or whatever) files for wheels. I haven't looked at the conda package recently.

kalefranz commented 6 years ago

From @mingwandroid on January 19, 2018 22:44

Some pandas details (in site-packages/pandas):

du -ah . | grep -v "/$" | sort -h | tail -n 10
2.2M    ./io
3.6M    ./tests/io/sas/data/DEMO_G.xpt
4.6M    ./tests/io/data
5.1M    ./core
11M ./tests/io/sas
11M ./tests/io/sas/data
15M ./_libs
19M ./tests/io
29M ./tests
52M .

So 29MB of 52MB is tests. We can split packages up quite easily now-a-days (using conda-build 3's new split packages feature), then the test: section can have a requires on the test package.

kalefranz commented 6 years ago

This turned into a discussion targeting specific packages, which I wasn't quite anticipating. Probably not exactly the right place to have it, but after trouble trying to move another long issue today, let's just keep it here.

kalefranz commented 6 years ago

From @mrocklin on January 19, 2018 23:26

@kalefranz my hope is that by looking at a couple of the worst offenders we can identify some common patterns that might apply to many of the packages and that we might turn into best practices.

kalefranz commented 6 years ago

From @mingwandroid on January 19, 2018 23:44

Well conda is a package manager and (apart from noarch: python) doesn't really know about or care about the contents of the packages.

This is more of a conda-build or an anaconda-issues issue.

kalefranz commented 6 years ago

From @mrocklin on January 20, 2018 0:1

Happy to move this conversation anywhere.

On Fri, Jan 19, 2018 at 5:44 PM, Ray Donnelly notifications@github.com wrote:

Well conda is a package manager and (apart from noarch: python) doesn't really know about or care about the contents of the packages.

This is more of a conda-build or an anaconda-issues issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/conda/conda/issues/6756#issuecomment-359120557, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszM5GePWq2o_EfrFgHyXr449w0Zezks5tMSjLgaJpZM4RlEgo .

kalefranz commented 6 years ago

I think my mover tool got rate-limited after ~30 comments. @mingwandroid Do you want this in anaconda-issues?

kalefranz commented 6 years ago

Without any direct knowledge, I agree with @seibert that the transition corresponding to the conda-build 3 hashes, probably new compilers, etc looks like something we should understand.

kalefranz commented 6 years ago

From @mingwandroid on January 20, 2018 11:35

conda-build 3 and new compilers have no bearing on the size of MKL since it is just binary repackaged. Intel adding stuff to MKL is the only cause here I reckon (but that's got nothing to do with what @mrocklin is reporting about scipy and pandas since he's using the nomkl variants).

anaconda-issues is fine by me.

kalefranz commented 6 years ago

From @jjhelmus on January 20, 2018 13:32

mkl 2017.0.4 includes the following libraries that were not in 2017.0.3:

libmkl_ao_worker.so              
libmkl_blacs_intelmpi_ilp64.so   
libmkl_blacs_intelmpi_lp64.so    
libmkl_blacs_openmpi_ilp64.so    
libmkl_blacs_openmpi_lp64.so     
libmkl_blacs_sgimpt_ilp64.so     
libmkl_blacs_sgimpt_lp64.so      
libmkl_cdft_core.so              
libmkl_gf_ilp64.so               
libmkl_gf_lp64.so                
libmkl_gnu_thread.so             
libmkl_pgi_thread.so             
libmkl_scalapack_ilp64.so        
libmkl_scalapack_lp64.so         
libmkl_tbb_thread.so 

libmkl_ao_worker.so and the three thread libraries are all over 20 MB in size.

This accounts for most of the increase in the unpacked size although a few other libraries did increase in size by a few MB between 2017.0.3 and 2017.0.4

kalefranz commented 6 years ago

From @rgommers on January 20, 2018 19:12

Regarding scipy and in general: conda packages being 2.5x larger than wheels seems to be a matter of sub-optimal compile flags in conda-build. Those must be dragged in implicitly, via different compiler/OS selection, because they're not in build recipes and such flags are very hard to change without monkeypatching numpy.distutils. This should be the easiest one to figure out and fix.

Cython is fairly heavily used in both scipy and pandas, and indeed results in quite large binaries. There's a lot of boilerplate per .so, so combining multiple functions in a single .pyx helps. And you have to be very careful with Tempita templating to support multiple dtypes, combinatorial explosion happens quickly. Example: the first implementation of a Cythonizedscipy.ndimage.label (a single simple function) was 4 MB, now it's still 70 kb. There's still a note on needing to investigate how much Cython usage we can tolerate on the scipy roadmap here.

Anecdotally I've noticed that avoiding and then installing scipy results in a 400MB increase, so presumably there are a number of dependency packages that were brought on.

This is not the case, scipy only depends on numpy. I suspect you've experienced a version selection issue where MKL gets dragged back in (priority issues between default and conda-forge with such kind of effects used to be common, not sure if that's still the case).

kalefranz commented 6 years ago

From @mingwandroid on January 21, 2018 0:19

Regarding scipy and in general: conda packages being 2.5x larger than wheels seems to be a matter of sub-optimal compile flags in conda-build

I disagree. conda-build is not responsible for setting compiler flags anymore. We have our own compilers and compiler activation scripts which handle that. I would argue that our flags are carefully considered and not far from some definitions of optimal (they generate fast and secure binaries at least).

The Anaconda Distribution packages are significantly smaller than everyone here according to my numbers:

Anaconda Distribution:

(base) bash-4.1# du -hs numpy scipy pandas
19M numpy
62M scipy
52M pandas

Manylinux1 wheels:

du -hs scipy numpy scipy pandas
76M numpy
190M    scipy
111M    pandas

And comparing a shared library from scipy:

Anaconda Distribution:

ls -l test-defaults/lib/python3.6/site-packages/scipy
-rwxrwxr-x 2 root root 3058248 Nov  7 18:15 ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so

Manylinux1 wheels:

ls -l ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
-rwxr-xr-x 1 root root 5302696 Jan 21 00:15 ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
kalefranz commented 6 years ago

From @rgommers on January 21, 2018 1:22

Ah I misread @mrocklin's initial post, sorry about that. The factor 2.5x was between default and conda-forge, with the latter being bigger. For wheels all he says is "Pip packages seem to be smaller, but are still unpleasantly large".

The conda-forge number still seems worrying, that should be optimized.

For wheels, it very much depends on platform: macOS wheels are 15 MB, win64 ones 29 MB, and manylinux1 ones 44 MB (as wheels, zip format, from PyPI).

kalefranz commented 6 years ago

@mrocklin If you've discovered anything new or additional information recently, can you update this issue?

Hammond95 commented 6 years ago

I have a problem that is even stranger: Two different packages on the same machine, both installing pandas and scientific packages.

The size differs from one package to another (one ~13MB and the other one ~50MB).

The packages have the same build and appears to be dowloaded from the same source so for me this is really strange.

Example (the same repo for both env)

https://repo.continuum.io/pkgs/main/osx-64/pandas-0.22.0-py36h0a44026_0.tar.bz2

removed screenshots

Do you have any idea?

mingwandroid commented 6 years ago

Please do not use screenshots for textual information. It makes people's jobs unnecessarily difficult and prevents web scrapers from being able to index the text.

mingwandroid commented 6 years ago

Also you've entered a deeply technical discussion with something completely tangential. Please open a new issue. However do some investigative work before you do, like drilling down into exactly what files are different. You are the only person capable of doing that since it's your computer.

Hammond95 commented 6 years ago

Sorry, I removed the screenshot, I thought they could be useful.

The problem is not only on my machine actually, but is the same on other colleagues machines, that have created the env using the exported yaml file of both envs.

I will keep investigating, do you have some idea?

mingwandroid commented 6 years ago

Sorry, I removed the screenshot, I thought they could be useful.

The information might be useful, but in text form and not to anyone who cares about this issue.

do you have some idea

Yes here are a few:

  1. Stop polluting this issue (which concerns the size of binaries created by Cython).
  2. Drill down into the actual file differences using some comparison tool (maybe write one?). I don't know what OS you use, but there are various ones available on different platforms to make this easier. I like Beyond Compare 4.
  3. If the results from 2. do not enlighten you as to the exact issue and you need help then: Open a new issue.