conda-forge / numpy-feedstock

A conda-smithy repository for numpy.
BSD 3-Clause "New" or "Revised" License
8 stars 46 forks source link

Large, unnecessary, proprietary mkl package included in numpy and pandas install, inflates binary by 600MB #84

Closed answerquest closed 6 years ago

answerquest commented 6 years ago

Ref:

mkl package is co-installed when we install either pandas or numpy using conda. It is a very large package clocking at ~200MB for download, and is ~600MB when installed in the pkgs folder of my MiniConda installation. The pip installer does not include this package when installing pandas. It is not there among conda feedstocks list and it has no description given on https://pypi.org/project/mkl/ . And..

License: Other/Proprietary License (Proprietary - Intel)
Author: Intel Corporation

I do not know more about this subject, but when I searched for mkl I came across more results for mkl-fft and mkl-random which is are not the same as mkl, and are under free licenses. mkl-fft's description on pypi also seems more numpy-involved. https://pypi.org/project/mkl-fft/

My hunch is that mkl-fft and mkl-random were the ones supposed to be included in the numpy installs and mkl got included by accident.

Where this is really causing a problem : when generating self-contained binaries for distribution, the mkl packages gets roped in for programs that import either numpy or pandas if conda has installed it in the python environment. For windows binary that the PyInstaller program creates, it balloons up the dist by about 600MBs.

Please investigate this and if it's not essential to numpy then remove mkl from the numpy installation by conda.

Info: Conda version: 4.5.1, on Windows 7 64-bit. As part of MiniConda Python3 64-bit.

Sharing lines from the numpy json file I found in my MiniConda installation's conda-meta folder:

 "arch": "x86_64",
  "build": "py36h5c71026_1",
  "build_number": 1,
  "channel": "https://repo.anaconda.com/pkgs/main/win-64",
  "constrains": [],
  "depends": [
    "icc_rt >=16.0.4",
    "mkl >=2018.0.2",
    "mkl_fft",
    "mkl_random",
    "python >=3.6,<3.7.0a0",
    "vc 14.*"
  ],

Sharing lines from [Miniconda3]\pkgs\mkl-2018.0.2-1\info\LICENSE.txt :

Intel Simplified Software License (Version January 2018)

For: Intel(R) Math Kernel Library (Intel(R) MKL)
     Intel(R) Integrated Performance Primitives (Intel(R) IPP)
     Intel(R) Machine Learning Scaling Library (Intel(R) MLSL)
     Intel(R) Data Analytics Acceleration Library (Intel(R) DAAL)
     Intel(R) Threading Building Blocks (Intel(R) TBB)
     Intel(R) Distribution for Python*
     Intel(R) MPI Library
jakirkham commented 6 years ago

First we don’t currently build numpy against mkl. Only defaults does that currently. Though they have a nomkl package that can be installed to opt-out. We build against openblas, which is BSD 3-Clause. That said, it’s possible in the future that we ship both options, OpenBLAS and MKL, letting users choose one much like defaults. MKL is actually Open License (not Open Source), which means we can link to it and share it freely should we wish to.

rgommers commented 6 years ago

MKL is actually Open License (not Open Source), which means we can link to it and share it freely should we wish to.

That's not quite complete. There is a potential issue here, especially when using PyInstaller or a similar such tool: it's possible that it's a GPL violation to distribute an executable with both MKL and a GPL component. The NumPy team has talked to Intel about this (answer, Intel will not give definitive legal advice) and gotten good independent advice (answer, GPL violation potentially possible here but the likelihood of that is case-specific).

To add to the answer to @answerquest: MKL or another BLAS package is definitely necessary for numpy. You're getting MKL because you have installed the Anaconda default numpy. If you use conda install -c conda-forge numpy you will get this package, and will then get OpenBLAS instead of MKL.

answerquest commented 6 years ago

Thanks for the clarification. Anyways as you can see in the support links posted, with the programs working perfectly fine without mkl installed, I'm going ahead with not using conda for installing the numpy and pandas packages for the time being and that will be my recommendation in the support forums when questions about the too large size pop up again. [Edit] it'll be better to use conda install -c conda-forge numpy to install numpy : it replaces mkl with OpenBLAS

What could help in this matter is if we could have a list of numpy/pandas commands that actually do need mkl, then people can have an objective way of determining whether their programs need it or not. The difference is a whopping 600MB in program size, so that is significant for any program creator (my program's binary is just 30MB when I go the no-conda way, and none of the functions are failing. It doesn't make any sense for me to include mkl just out of a sense of formality/loyalty) and is well worth the disambiguation.

Also, in a conda install, if there can be a way to manually specify which dependency is to be excluded, then that can also be a good workaround, as the other benefits of conda over pip are still there and I still want to use conda.

rgommers commented 6 years ago

@answerquest that's not the best recommendation unfortunately. It works in that case, but installing numpy with pip inside a conda env is not a good idea. numpy is special-cased by conda, so it's about the only thing that you really shouldn't install with pip. Two better alternatives:

  1. conda install -c conda-forge numpy (will give you the same OpenBLAS dependency as the official numpy wheel has that pip grabs)
  2. Don't use conda, but create a clean virtualenv and install with pip into that.
answerquest commented 6 years ago

@rgommers my bad, sorry, I had not read the OpenBLAS line correctly. If -c conda-forge helps to exclude mkl then that's a good solution indeed. I'm guessing OpenBLAS is not 600MB in size?

Definitely using virtual environment to create the binary.

rgommers commented 6 years ago

Indeed, should be <10 MB.

whekman commented 5 years ago

For anyone trying to do as @rgommers suggests (option 1. - it worked in the end!). The following might save you 1 hour of puzzling: stackoverflow thread.

I was having difficulty installing pyinstaller AND numpy with openblas just now because my "conda install -c conda-forge pyinstaller" command resulted in numpy being "upgraded" to an mkl-linked one. The link explained a great deal and pyinstaller now makes my .py "import numpy" into an exe (on windows) of <14mb :)

Still, scary to be so dependent on what version is available/downloaded via conda. Would be a shame not to be able to make small executables which make use of numpy. Should I be worried?

jakirkham commented 5 years ago

FWIW what I typically do is conda install conda-forge::blas=*=openblas. This ensures you will get OpenBLAS backed NumPy and friends.

msarahan commented 5 years ago

Yes, or add

blas=*=openblas

To your condarc, https://conda.io/docs/user-guide/configuration/use-condarc.html#always-add-packages-by-default-create-default-packages

FSund commented 5 years ago

@msarahan How do I add this? I tried adding it at the bottom of the .condarc in my environment, but then I get the following error

LoadError: Load Error: in C:\Users\filip\Anaconda3\envs\sci\.condarc on line 3, column 15. Invalid YAML
answerquest commented 5 years ago

Hi, just FYI (not replying to any earlier post here), I've since had no problems in using just pip to install numpy and pandas modules for my application. If they're leaving anything out, then my prog isn't using it anyways and I haven't experienced any problems off it. The pyinstaller-generated .exe (single-file) is only around 30mb that too without upx compression.
(Note: don't use upx compression if making single-file exe using pyinstaller, as upx screws up one of the dll's)

Earlier pip was having a problem with pandas, which was why I was using conda, but that got resolved just some days after I had posted here. This update isn't relevant for this repo but seeing that there's activity here and I was the OP, I have an obligation to disclose how I finally solved the problem on my end. I went with pip and it worked out fine.
No hard feelings for conda folks, hope you don't mind this update.

FSund commented 5 years ago

Hi, just FYI (not replying to any earlier post here), I've since had no problems in using just pip to install numpy and pandas modules for my application. If they're leaving anything out, then my prog isn't using it anyways and I haven't experienced any problems off it. The pyinstaller-generated .exe (single-file) is only around 30mb that too without upx compression. (Note: don't use upx compression if making single-file exe using pyinstaller, as upx screws up one of the dll's)

This issue is fixed in newer versions, so installing numpy from conda-forge should get you a openblas-version.

But there is no openblas/nomkl version of scipy on Windows yet, so I'm using pip to install scipy. I have the same experience as you, no issues, but something is probably not getting installed correctly. But I prefer not mixing pip and conda, so I'd love a conda-forge version of scipy. Work on that is going on here: https://github.com/conda-forge/scipy-feedstock/pull/78

Gnomic20 commented 4 years ago

This is slow - 1/3 of my pandas load time is in one slow call. ncalls tottime percall cumtime percall filename:lineno(function) 1 2.295 2.295 2.295 2.295 {built-in method mkl._py_mkl_service.get_version}

relevant versions INSTALLED VERSIONS

commit : f2ca0a2665b2d169c97de87b8e778dbed86aea07 python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.18362 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel

pandas : 1.1.1 numpy : 1.19.1

mkl isn't even listed in show versions.