Open kalefranz opened 6 years ago
From @seibert on January 19, 2018 21:31
Probably the largest single space consumer in a conda environment is MKL, which is linked to NumPy in Anaconda.
On linux-64, the unpacked package sizes have been slowly growing over time:
408M mkl-11.3.3-0
459M mkl-2017.0.1-0
465M mkl-2017.0.3-0
648M mkl-2017.0.4-h4c4d0af_0
648M mkl-2018.0.0-hb491cac_4
699M mkl-2018.0.1-h19d6760_4
The large size is because MKL ships with many implementations for different Intel architectures. Unfortunately, I don't think MKL supports stripping out unneeded architectures.
From @seibert on January 19, 2018 21:33
@kalefranz, @msarahan: The jump between the old conda-build and the new conda-build version of MKL looks suspicious. Did something in the packaging of MKL change when updating the recipes?
From @mrocklin on January 19, 2018 21:35
In the example above I explicitly avoided mkl . However perhaps other libraries are large for similar reasons?
On Jan 19, 2018 4:31 PM, "Stan Seibert" notifications@github.com wrote:
Probably the largest single space consumer in a conda environment is MKL, which is linked to NumPy in Anaconda.
On linux-64, the unpacked package sizes have been slowly growing over time: 408M mkl-11.3.3-0 459M mkl-2017.0.1-0 465M mkl-2017.0.3-0 648M mkl-2017.0.4-h4c4d0af_0 648M mkl-2018.0.0-hb491cac_4 699M mkl-2018.0.1-h19d6760_4
The large size is because MKL ships with many implementations for different Intel architectures. Unfortunately, I don't think MKL support stripping out unneeded architectures.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/conda/conda/issues/6756#issuecomment-359094715, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszHhwfvfyINw1fqitQUBerpvOcf9Zks5tMQm3gaJpZM4RlEgo .
From @seibert on January 19, 2018 21:35
Oh, sorry, I missed the nomkl
in the install command. MKL is uniquely strange because it is engineered for per-architecture dispatch. I don't think any other packages have this problem.
From @mingwandroid on January 19, 2018 21:38
The binaries are not stripped:
ls -l ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
-rwxrwxr-x 2 root root 3058248 Nov 7 18:15 ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
strip ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
ls -l ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
-rwxrwxr-x 2 root root 1708976 Jan 19 21:37 ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
From @mingwandroid on January 19, 2018 21:45
But stripping them all only saves about 6MB. It seems the shared libraries are chock full of instruction code:
size cython_special.cpython-36m-x86_64-linux-gnu.so
text data bss dec hex filename
3616927 75408 6264 3698599 386fa7 cython_special.cpython-36m-x86_64-linux-gnu.so
From @mingwandroid on January 19, 2018 21:48
These are Cython-generated binaries at the end of the day, some tips on reducing their size: https://gist.github.com/tito/9414743
From @mingwandroid on January 19, 2018 22:13
Looking into one of them and sorting the 10 largest functions by size:
nm -SD --size-sort cython_special.cpython-36m-x86_64-linux-gnu.so | tail -n 10
00000000002471b0 0000000000003346 T zbknu_
00000000002bb970 0000000000003733 T ciknb_
000000000032adb0 00000000000037d4 T cikva_
0000000000327470 0000000000003934 T cikvb_
00000000002b7e10 0000000000003b55 T cjynb_
00000000002e8560 0000000000003d4a T cjyvb_
000000000030ba00 00000000000041e3 T cjyna_
0000000000303920 0000000000006042 T hygfz_
00000000002df8a0 0000000000008cb8 T cjyva_
0000000000105b70 000000000012e6ac T PyInit_cython_special
PyInit_cython_special is huge; more than 1MB.
From @mingwandroid on January 19, 2018 22:24
Disassembling this function:
497: 48 c7 84 24 f0 02 00 movq $0x0,0x2f0(%rsp)
49e: 00 00 00 00 00
4a3: 48 c7 44 24 40 00 00 movq $0x0,0x40(%rsp)
4aa: 00 00
4ac: 48 c7 84 24 f8 02 00 movq $0x0,0x2f8(%rsp)
4b3: 00 00 00 00 00
4b8: 48 c7 84 24 98 00 00 movq $0x0,0x98(%rsp)
4bf: 00 00 00 00 00
4c4: 48 c7 84 24 00 03 00 movq $0x0,0x300(%rsp)
4cb: 00 00 00 00 00
4d0: 48 c7 84 24 90 00 00 movq $0x0,0x90(%rsp)
4d7: 00 00 00 00 00
4dc: 48 c7 84 24 08 03 00 movq $0x0,0x308(%rsp)
4e3: 00 00 00 00 00
4e8: 48 c7 44 24 38 00 00 movq $0x0,0x38(%rsp)
4ef: 00 00
4f1: 48 c7 84 24 10 03 00 movq $0x0,0x310(%rsp)
4f8: 00 00 00 00 00
4fd: 48 c7 44 24 30 00 00 movq $0x0,0x30(%rsp)
504: 00 00
506: 48 c7 84 24 18 03 00 movq $0x0,0x318(%rsp)
50d: 00 00 00 00 00
512: 48 c7 44 24 28 00 00 movq $0x0,0x28(%rsp)
519: 00 00
51b: 48 c7 84 24 20 03 00 movq $0x0,0x320(%rsp)
522: 00 00 00 00 00
527: 48 c7 44 24 20 00 00 movq $0x0,0x20(%rsp)
52e: 00 00
530: 48 c7 84 24 28 03 00 movq $0x0,0x328(%rsp)
537: 00 00 00 00 00
53c: 48 c7 44 24 18 00 00 movq $0x0,0x18(%rsp)
543: 00 00
545: 48 c7 84 24 30 03 00 movq $0x0,0x330(%rsp)
54c: 00 00 00 00 00
551: 48 c7 44 24 10 00 00 movq $0x0,0x10(%rsp)
558: 00 00
55a: 48 c7 84 24 38 03 00 movq $0x0,0x338(%rsp)
561: 00 00 00 00 00
566: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp)
56d: 00 00
56f: 48 c7 84 24 40 03 00 movq $0x0,0x340(%rsp)
576: 00 00 00 00 00
57b: 48 c7 84 24 48 03 00 movq $0x0,0x348(%rsp)
582: 00 00 00 00 00
587: 48 c7 84 24 50 03 00 movq $0x0,0x350(%rsp)
58e: 00 00 00 00 00
593: 48 c7 84 24 58 03 00 movq $0x0,0x358(%rsp)
59a: 00 00 00 00 00
59f: 48 c7 84 24 60 03 00 movq $0x0,0x360(%rsp)
5a6: 00 00 00 00 00
5ab: 48 c7 84 24 68 03 00 movq $0x0,0x368(%rsp)
5b2: 00 00 00 00 00
5b7: 48 c7 04 24 00 00 00 movq $0x0,(%rsp)
5be: 00
5bf: 48 c7 84 24 70 03 00 movq $0x0,0x370(%rsp)
5c6: 00 00 00 00 00
5cb: 48 c7 84 24 88 00 00 movq $0x0,0x88(%rsp)
5d2: 00 00 00 00 00
5d7: 48 c7 84 24 78 03 00 movq $0x0,0x378(%rsp)
5de: 00 00 00 00 00
5e3: 48 c7 84 24 80 00 00 movq $0x0,0x80(%rsp)
5ea: 00 00 00 00 00
5ef: 48 c7 84 24 80 03 00 movq $0x0,0x380(%rsp)
5f6: 00 00 00 00 00
5fb: 48 c7 44 24 78 00 00 movq $0x0,0x78(%rsp)
602: 00 00
604: 48 c7 84 24 88 03 00 movq $0x0,0x388(%rsp)
60b: 00 00 00 00 00
610: 48 c7 44 24 70 00 00 movq $0x0,0x70(%rsp)
617: 00 00
619: 48 c7 84 24 90 03 00 movq $0x0,0x390(%rsp)
620: 00 00 00 00 00
625: 48 c7 44 24 68 00 00 movq $0x0,0x68(%rsp)
62c: 00 00
62e: 48 c7 84 24 98 03 00 movq $0x0,0x398(%rsp)
635: 00 00 00 00 00
63a: 48 c7 44 24 60 00 00 movq $0x0,0x60(%rsp)
641: 00 00
643: 48 c7 84 24 a0 03 00 movq $0x0,0x3a0(%rsp)
64a: 00 00 00 00 00
64f: 48 c7 44 24 58 00 00 movq $0x0,0x58(%rsp)
656: 00 00
658: 48 c7 84 24 a8 03 00 movq $0x0,0x3a8(%rsp)
65f: 00 00 00 00 00
664: 48 c7 44 24 50 00 00 movq $0x0,0x50(%rsp)
66b: 00 00
66d: 48 c7 84 24 b0 03 00 movq $0x0,0x3b0(%rsp)
674: 00 00 00 00 00
679: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
680: 4d 85 d2 test %r10,%r10
683: 74 20 je 106215 <PyInit_cython_special+0x6a5>
685: 49 83 2a 01 subq $0x1,(%r10)
689: 75 1a jne 106215 <PyInit_cython_special+0x6a5>
68b: 49 8b 42 08 mov 0x8(%r10),%rax
68f: 4c 89 84 24 b8 03 00 mov %r8,0x3b8(%rsp)
.. lots and lots of this clearing of structures on the very stack slowly.
From @mrocklin on January 19, 2018 22:28
cc @jreback @tomaugspurger for Pandas things
@mingwandroid any thoughts on other offending packages? I would expect a different story from SciPy. also cc @stefanv
From @mingwandroid on January 19, 2018 22:34
This is scipy. I would be interested to see if -Os
makes a dent in this and in general why all of these movq
instructions are not being coalesced into a memset
.
From @TomAugspurger on January 19, 2018 22:44
For the sdist, pandas includes a bunch of data files that are only used for testing. I don't think they should be included in source or binary distributions.
Other than that, the largest files are .c
files for sdists and .so
(or whatever) files for wheels. I haven't looked at the conda package recently.
From @mingwandroid on January 19, 2018 22:44
Some pandas
details (in site-packages/pandas
):
du -ah . | grep -v "/$" | sort -h | tail -n 10
2.2M ./io
3.6M ./tests/io/sas/data/DEMO_G.xpt
4.6M ./tests/io/data
5.1M ./core
11M ./tests/io/sas
11M ./tests/io/sas/data
15M ./_libs
19M ./tests/io
29M ./tests
52M .
So 29MB of 52MB is tests. We can split packages up quite easily now-a-days (using conda-build 3's new split packages feature), then the test:
section can have a requires
on the test package.
This turned into a discussion targeting specific packages, which I wasn't quite anticipating. Probably not exactly the right place to have it, but after trouble trying to move another long issue today, let's just keep it here.
From @mrocklin on January 19, 2018 23:26
@kalefranz my hope is that by looking at a couple of the worst offenders we can identify some common patterns that might apply to many of the packages and that we might turn into best practices.
From @mingwandroid on January 19, 2018 23:44
Well conda is a package manager and (apart from noarch: python) doesn't really know about or care about the contents of the packages.
This is more of a conda-build or an anaconda-issues issue.
From @mrocklin on January 20, 2018 0:1
Happy to move this conversation anywhere.
On Fri, Jan 19, 2018 at 5:44 PM, Ray Donnelly notifications@github.com wrote:
Well conda is a package manager and (apart from noarch: python) doesn't really know about or care about the contents of the packages.
This is more of a conda-build or an anaconda-issues issue.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/conda/conda/issues/6756#issuecomment-359120557, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszM5GePWq2o_EfrFgHyXr449w0Zezks5tMSjLgaJpZM4RlEgo .
I think my mover tool got rate-limited after ~30 comments. @mingwandroid Do you want this in anaconda-issues?
Without any direct knowledge, I agree with @seibert that the transition corresponding to the conda-build 3 hashes, probably new compilers, etc looks like something we should understand.
From @mingwandroid on January 20, 2018 11:35
conda-build 3 and new compilers have no bearing on the size of MKL
since it is just binary repackaged. Intel adding stuff to MKL
is the only cause here I reckon (but that's got nothing to do with what @mrocklin is reporting about scipy
and pandas
since he's using the nomkl
variants).
anaconda-issues
is fine by me.
From @jjhelmus on January 20, 2018 13:32
mkl 2017.0.4 includes the following libraries that were not in 2017.0.3:
libmkl_ao_worker.so
libmkl_blacs_intelmpi_ilp64.so
libmkl_blacs_intelmpi_lp64.so
libmkl_blacs_openmpi_ilp64.so
libmkl_blacs_openmpi_lp64.so
libmkl_blacs_sgimpt_ilp64.so
libmkl_blacs_sgimpt_lp64.so
libmkl_cdft_core.so
libmkl_gf_ilp64.so
libmkl_gf_lp64.so
libmkl_gnu_thread.so
libmkl_pgi_thread.so
libmkl_scalapack_ilp64.so
libmkl_scalapack_lp64.so
libmkl_tbb_thread.so
libmkl_ao_worker.so and the three thread libraries are all over 20 MB in size.
This accounts for most of the increase in the unpacked size although a few other libraries did increase in size by a few MB between 2017.0.3 and 2017.0.4
From @rgommers on January 20, 2018 19:12
Regarding scipy and in general: conda packages being 2.5x larger than wheels seems to be a matter of sub-optimal compile flags in conda-build
. Those must be dragged in implicitly, via different compiler/OS selection, because they're not in build recipes and such flags are very hard to change without monkeypatching numpy.distutils
. This should be the easiest one to figure out and fix.
Cython is fairly heavily used in both scipy and pandas, and indeed results in quite large binaries. There's a lot of boilerplate per .so, so combining multiple functions in a single .pyx helps. And you have to be very careful with Tempita templating to support multiple dtypes, combinatorial explosion happens quickly. Example: the first implementation of a Cythonizedscipy.ndimage.label
(a single simple function) was 4 MB, now it's still 70 kb. There's still a note on needing to investigate how much Cython usage we can tolerate on the scipy roadmap here.
Anecdotally I've noticed that avoiding and then installing scipy results in a 400MB increase, so presumably there are a number of dependency packages that were brought on.
This is not the case, scipy only depends on numpy. I suspect you've experienced a version selection issue where MKL gets dragged back in (priority issues between default
and conda-forge
with such kind of effects used to be common, not sure if that's still the case).
From @mingwandroid on January 21, 2018 0:19
Regarding scipy and in general: conda packages being 2.5x larger than wheels seems to be a matter of sub-optimal compile flags in conda-build
I disagree. conda-build
is not responsible for setting compiler flags anymore. We have our own compilers and compiler activation scripts which handle that. I would argue that our flags are carefully considered and not far from some definitions of optimal (they generate fast and secure binaries at least).
The Anaconda Distribution
packages are significantly smaller than everyone here according to my numbers:
Anaconda Distribution
:
(base) bash-4.1# du -hs numpy scipy pandas
19M numpy
62M scipy
52M pandas
Manylinux1
wheels:
du -hs scipy numpy scipy pandas
76M numpy
190M scipy
111M pandas
And comparing a shared library from scipy
:
Anaconda Distribution
:
ls -l test-defaults/lib/python3.6/site-packages/scipy
-rwxrwxr-x 2 root root 3058248 Nov 7 18:15 ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
Manylinux1
wheels:
ls -l ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
-rwxr-xr-x 1 root root 5302696 Jan 21 00:15 ./special/_ufuncs.cpython-36m-x86_64-linux-gnu.so
From @rgommers on January 21, 2018 1:22
Ah I misread @mrocklin's initial post, sorry about that. The factor 2.5x was between default
and conda-forge
, with the latter being bigger. For wheels all he says is "Pip packages seem to be smaller, but are still unpleasantly large".
The conda-forge
number still seems worrying, that should be optimized.
For wheels, it very much depends on platform: macOS wheels are 15 MB, win64 ones 29 MB, and manylinux1 ones 44 MB (as wheels, zip format, from PyPI).
@mrocklin If you've discovered anything new or additional information recently, can you update this issue?
I have a problem that is even stranger: Two different packages on the same machine, both installing pandas and scientific packages.
The size differs from one package to another (one ~13MB and the other one ~50MB).
The packages have the same build and appears to be dowloaded from the same source so for me this is really strange.
Example (the same repo for both env)
https://repo.continuum.io/pkgs/main/osx-64/pandas-0.22.0-py36h0a44026_0.tar.bz2
removed screenshots
Do you have any idea?
Please do not use screenshots for textual information. It makes people's jobs unnecessarily difficult and prevents web scrapers from being able to index the text.
Also you've entered a deeply technical discussion with something completely tangential. Please open a new issue. However do some investigative work before you do, like drilling down into exactly what files are different. You are the only person capable of doing that since it's your computer.
Sorry, I removed the screenshot, I thought they could be useful.
The problem is not only on my machine actually, but is the same on other colleagues machines, that have created the env using the exported yaml file of both envs.
I will keep investigating, do you have some idea?
Sorry, I removed the screenshot, I thought they could be useful.
The information might be useful, but in text form and not to anyone who cares about this issue.
do you have some idea
Yes here are a few:
From @mrocklin on January 19, 2018 21:25
I sometimes want to put conda environments in docker containers. Unfortunately, installing the classic scientific Python stack results in docker image sizes that are easily 1-3GB in size, which can be problematic when moving them around.
Example environment
Here is an example environment that I care about today:
This takes up 1.3GB of space.
I'm curious what all parts of the community can do to reduce this amount, including both package builders and library authors.
Example offenders, pandas and scipy
Some of the biggest offenders include
scipy
, andpandas
.This is worse on conda-forge
And I suspect that this does not fully encompass the problem. Anecdotally I've noticed that avoiding and then installing scipy results in a 400MB increase, so presumably there are a number of dependency packages that were brought on.
pip
Pip packages seem to be smaller, but are still unpleasantly large. An environment similar to the one above consumes around 800MB on my machine.
What's going on?
Is all of this compiled code? If so is it expected to get hundreds of megabytes of binary artifacts? Even given the code complexity of these projects it surprises me that they can not be expressed more concisely than hundreds of megabytes. Are we including some dependencies multiple times? Are we accidentally shipping test data around? What is going on? How can we help reduce the size, in bytes, of the scipy stack?
Copied from original issue: conda/conda#6756