docker-library / python

Docker Official Image packaging for Python
https://www.python.org/
MIT License
2.57k stars 1.07k forks source link

Python should be configured/built with --enable-shared option. #21

Closed GrahamDumpleton closed 9 years ago

GrahamDumpleton commented 10 years ago

Currently the Python installations are built with a command like:

RUN mkdir -p /usr/src/python \
    && curl -SL "https://www.python.org/ftp/python/$PYTHON_VERSION/Python-$PYTHON_VERSION.tar.xz" \
        | tar -xJC /usr/src/python --strip-components=1 \
    && cd /usr/src/python \
    && ./configure \
    && make -j$(nproc) \
    && make install \
    && cd / \
    && rm -rf /usr/src/python

When running configure, you should be supplying the --enable-shared option to ensure that shared libraries are built for Python. By not doing this you are preventing any application which wants to use Python as an embedded environment from working. This is because the lack of the shared library results in any embedded system failing at compile time with:

/usr/bin/ld: /usr/local/lib/libpython2.7.a(abstract.o): relocation R_X86_64_32S against `_Py_NotImplementedStruct' can not be used when making a shared object; recompile with -fPIC

/usr/local/lib/libpython2.7.a: error adding symbols: Bad value

collect2: error: ld returned 1 exit status

error: command 'gcc' failed with exit status 1

This basic mistake is something that Linux distributions themselves made for many years and it took a lot of complaining and education to get them to fix their Python installations. It would be nice to see you address this and do what all decent Linux distributions do now, and have done for a while, and install Python with shared libraries.

tianon commented 10 years ago

Do you happen to have a link to some upstream documentation about recommended configure flags?

The best I can find are the following: (which is why our defaults are what they are)

None of which mention any flags that we ought to set in the general case, so I'm hoping there's a better document somewhere we can point to for an explanation of why we set the flags we do.

GrahamDumpleton commented 10 years ago

As you have found, the Python documentation isn't particularly helpful in this respect.

The best thing to do is look at what the base Linux distro you are using is itself using when building Python.

$ docker run -a stdin -a stdout -i -t buildpack-deps /bin/bash
root@b163842c23f2:/# apt-get update
Get:1 http://security.debian.org jessie/updates InRelease [84.1 kB]
Get:2 http://http.debian.net jessie InRelease [191 kB]
Get:3 http://security.debian.org jessie/updates/main amd64 Packages [20 B]
Get:4 http://http.debian.net jessie-updates InRelease [117 kB]
Get:5 http://http.debian.net jessie/main amd64 Packages [9104 kB]
Get:6 http://http.debian.net jessie-updates/main amd64 Packages [20 B]
Fetched 9497 kB in 17s (553 kB/s)
Reading package lists... Done
root@b163842c23f2:/# apt-get install python2.7-dev
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
  libpython2.7 libpython2.7-dev
The following NEW packages will be installed:
  libpython2.7 libpython2.7-dev python2.7-dev
0 upgraded, 3 newly installed, 0 to remove and 96 not upgraded.
Need to get 33.5 MB of archives.
After this operation, 48.9 MB of additional disk space will be used.
Do you want to continue? [Y/n] y
Get:1 http://http.debian.net/debian/ jessie/main libpython2.7 amd64 2.7.8-11 [1080 kB]
Get:2 http://http.debian.net/debian/ jessie/main libpython2.7-dev amd64 2.7.8-11 [32.1 MB]
Get:3 http://http.debian.net/debian/ jessie/main python2.7-dev amd64 2.7.8-11 [265 kB]
Fetched 33.5 MB in 50s (666 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libpython2.7:amd64.
(Reading database ... 29248 files and directories currently installed.)
Preparing to unpack .../libpython2.7_2.7.8-11_amd64.deb ...
Unpacking libpython2.7:amd64 (2.7.8-11) ...
Selecting previously unselected package libpython2.7-dev:amd64.
Preparing to unpack .../libpython2.7-dev_2.7.8-11_amd64.deb ...
Unpacking libpython2.7-dev:amd64 (2.7.8-11) ...
Selecting previously unselected package python2.7-dev.
Preparing to unpack .../python2.7-dev_2.7.8-11_amd64.deb ...
Unpacking python2.7-dev (2.7.8-11) ...
root@b163842c23f2:/# grep CONFIG_ARGS /usr/lib/python2.7/config-x86_64-linux-gnu/Makefile
CONFIG_ARGS=     '--enable-shared' '--prefix=/usr' '--enable-ipv6' '--enable-unicode=ucs4' '--with-dbmliborder=bdb:gdbm' '--with-system-expat' '--with-system-ffi' '--with-fpectl' 'CC=x86_64-linux-gnu-gcc' 'CFLAGS=-D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security ' 'LDFLAGS=-Wl,-z,relro'
    $(SHELL) $(srcdir)/configure $(CONFIG_ARGS)

So you have a number of options which are telling Python to use system variants of some C libraries such as expat, rather than bundled libraries. This is going to be due to Debian's philosophy of never allowing packages to use bundled versions of something that it already installs separately as a Debian package. Whether you want to go to that extent I am not sure.

Next is the presence of --enable-unicode=ucs4. By default Python when built from source code uses ucs2 for Unicode. I don't really understand the Linux rational for using ucs4 but every Linux distribution I have seen overrides it and uses ucs4.

You then have --enable-shared which is the one I am mainly concerned about.

And besides that you have a mix of other options which I don't personally understand the implications of being set, such as --enable-ipv6, --with-dbmliborder=bdb:gdbm and --with-fpectl.

Configure help for these options are:

  --enable-ipv6           Enable ipv6 (with ipv4) support
  --with-dbmliborder=db1:db2:...
                          order to check db backends for dbm. Valid value is a
                          colon separated string with the backend names
                          `ndbm', `gdbm' and `bdb'.
  --with-fpectl           enable SIGFPE catching

  --enable-shared         disable/enable building shared python library
  --enable-unicode[=ucs[24]]
                          Enable Unicode strings (default is ucs2)
yosifkit commented 9 years ago

@GrahamDumpleton: @tianon is very familiar with debian packaging. The issue is that we want to do what python itself recommends, and not follow blindly what a packager in debian or [insert linux distro here] does. Since there is little in their own documentation, we could also point to a common use case that requires python to be built with the shared configure option.

@tianon, perhaps this would be enough (mod_wsgi in apache2)? And maybe this will help (to use python within PostgreSQL)?

GrahamDumpleton commented 9 years ago

I can lodge a bug report against Python documentation if it helps with suggested edits and we can perhaps get them updated to provide examples of more sane default. I will have to pester @ncoghlan about best path to take.

tianon commented 9 years ago

That would be amazing! I'd really love to see this particular bit better documented upstream, especially since it's recommended. :heart:

tianon commented 9 years ago

Just adding --enable-shared to each Python build, I'm getting the following (just when trying to run the resulting binary):

2.7: python2: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
3.3: python3: error while loading shared libraries: libpython3.3m.so.1.0: cannot open shared object file: No such file or directory
3.4: python3: error while loading shared libraries: libpython3.4m.so.1.0: cannot open shared object file: No such file or directory

The relevant *.so files are in /usr/local/lib appropriately, so there must just be something with ldconfig that we're missing.

tianon commented 9 years ago

Haha, just running ldconfig is sufficient. Carrying on!

GrahamDumpleton commented 9 years ago

Yep. I added:

RUN ldconfig -v

to my copy. Was wondering if you would get confused or know to add it. :-)

GrahamDumpleton commented 9 years ago

Another thing for you to think about.

The bundled tests installed with Python modules take up quite a lot of room. Do you really need to have them installed?

For even more memory savings on image size, are the benefits from having pyc/pyo files really worth it.

Together the tests and compiled Python code files take 30MB or so.

So in my cases when needing to build custom Python installations and size is an issue, and where processes are principally long running, I prune both:

[prune-python]
recipe = plone.recipe.command
command = find ${python:libdir} \
           \( -type d -and -name test -or -name tests \) -or \
           \( -type f -and -name '*.pyc' -or -name '*.pyo' \) \
           -exec rm -rf {} \;

I can then get the installed Python size done to 70Mb.

tianon commented 9 years ago

Ah nice, good point. Definitely worth adding that too, IMO; thanks! :D

Starefossen commented 9 years ago

Next is the presence of --enable-unicode=ucs4. By default Python when built from source code uses ucs2 for Unicode. I don't really understand the Linux rational for using ucs4 but every Linux distribution I have seen overrides it and uses ucs4.

I ran into this issue that this version of Python isn't compiled with --enable-unicode=ucs4. Is there any possiblility to append/override configuration options using environment variables like pyenv does?

gjcarneiro commented 9 years ago

Unfortunately --enable-shared builds a CPython interpreter that is slower. What distributions usually do is build Python twice, once static, to make the /usr/bin/python interpreter faster, and another time with --enable-shared just so that it installs the shared library.

GrahamDumpleton commented 9 years ago

By all means have a static 'python' executable but still provide a libpythonX.Y.so shared library. You would have to be careful about the order in which they are installed, or control what is installed, as installing the shared second may overwrite the static 'python' executable. I am also not sure if installing the static second though will cause problems with the configuration snapshot which is generated in the 'config/Makefile' and similar that some of the distutils and other stuff depends on. So you would need to verify that it all plays well.

I would suggest though that for the bulk of things any difference is negligible to the point of being insignificant. All the Python C extension modules are still going to be loaded dynamically. It would have to be an application which is very heavily biased to CPU bound tasks running pure Python code only (as those benchmarks you quote generally are) and little calling out to anything else which would shift things to separate extension modules or even turning it into a more I/O bound task. When we talk about such heavy duty data processing one often talks about using numpy and similar modules and they rely on Python C extension modules. So your mileage would vary and one would more likely see much more significant performance gains through attention to good choice of appropriate algorithms and Python language constructs. So use of static linking is not some magic solution and people in general would do better by looking at the design of their code instead.

gjcarneiro commented 9 years ago

I think 9.3% difference average is very significant. The difference in speed could be as high as 30% in some cases, from the benchmarks I linked.

GrahamDumpleton commented 9 years ago

Those benchmarks are principally CPU bound tasks and not representative of most real world applications. Once you start factoring in issues like the Python global interpreter lock, use of Python C extensions, I/O etc, any difference should realistically rapidly fall away. As I said, working on improving things at the Python code level would generally result in much bigger gains.

So I am not saying having a static 'python' executable is a bad idea, but I would take any suggestion that it in general makes a big difference with a grain of salt.

gjcarneiro commented 9 years ago

Well, it makes enough difference so that Debian/Ubuntu decided to keep the Python interpreter statically built. Funny enough, Fedora instead builds a shared lib based Python. I'm surprised.

Still, I would prefer a Python optimised for running programs, not for embedding. 90%[1] of Python usage out there is for standalone scripts, not for embedding.

[1] 90% is a completely made up number

ncoghlan commented 9 years ago

The 90% is both completely made up, and almost certainly wildly inaccurate.

The thing to remember is that _modwsgi embeds a Python interpreter into Apache processes so you can benefit from the rich ecosystem of Apache modules (especially for authentication and authorization) in Python based web applications, rather than having to reinvent all those wheels at the Python layer.

If folks are genuinely worried about the CPU bound speed of a network service written in Python, the answer isn't to make small tweaks to the CPython build settings, it's to get their service running under PyPy instead of continuing to use CPython: http://speed.pypy.org/

gjcarneiro commented 9 years ago

The way I see it, this Docker image is the one making tweaks to the CPython build settings. By default, CPython builds a static interpreter, not a shared lib based one. You must be thinking of the Red Hat case, where Python is shlib based by default. But in upstream CPython, build is static by default. There must be a reason...

I would argue that your typical website does not use Apache because it does not need Apache authentication and authorization. Instead, they have their own auth layer built at application level (login forms). The way I do web deployment is via Gunicorn, which is a standalone process. Apache mod_wsgi is an outdated way to deploy web apps. See this. In fact, I still use Apache for static content and I'm sick of it, all the security just gets in the way.

And no, PyPy is not the answer to everything. I/O bound tasks can become slower under PyPy.

It would be nice to have this Docker image split into two. The base image would provide Python interpreter compiled statically and standard library . Another image would be built on top of the first one and would provide just the shared library.

Anyway, I am just making my point. Ever since I found out that Red Hat-based distros have a shared-lib hased interpreter already, in contrast to Debian based ones, I think this is probably a less relevant problem than I originally imagined. I would still prefer the small speed up of having a static python, but I also understand the other side of the argument, that having only shared simplifies things with only a small performance cost.

ncoghlan commented 9 years ago

I'm afraid I'm still not following your argument. If your task is IO bound, then neither PyPy nor a statically linked CPython will make your application any faster. If it's CPU bound, then a JIT compiled Python like PyPy or Numba, or an ahead of time compiler like Cython, is going to make far more of a difference than whether CPython was built as a shared library or not.

I also wouldn't place too much weight on our default settings upstream - CPython was originally only available as a statically linked executable, with shared library support added later (starting with https://hg.python.org/cpython-fullhistory/rev/3a70e9c0d9f5). In those kinds of situations, "it was implemented first" is the main determinant of the default behaviour, rather than any specific technical difference between the available options.

The part we can agree on is that the small speed up from static linking isn't worth the extra effort of maintaining a separate image that supports dynamic embedding and having to explain to people that some applications (like mod_wsgi) won't work on the default image.

alexlusher commented 8 years ago

Hi Graham,

I am trying to use your guidelines in order to compile Python 2.7.12 on CentOS 7 with "--enable-shared" but it leads to a very weird outcome described by some others - the compiled binary reports the version 2.7.5 like the system-wide Python used by yum. It actually compiles correctly without the "shared" option. Can you kindly point out what needs to be done to fix it?

STEP 1: Preparations

yum groupinstall -y development yum install -y centos-release-SCL yum install -y zlib-dev openssl-devel sqlite-devel bzip2-devel glibc-devel expat-devel gdbm-devel readline-devel tcl tcl-devel tk tk-devel ncurses-devel db4-devel libpcap-devel xz-devel xz-libs

STEP 2: Configuring before the Make

./configure –-enable-shared –-enable-unicode=ucs4 --prefix=/opt/python/python2.7.12

STEP 3: Make and Alt-Install make sudo make altinstall

Gratefully, Alex

GrahamDumpleton commented 8 years ago

@alexlusher What you need to do has been extensively documented in:

When you say 'guidelines' are you referring to comments above, or that post?

That you are using make altinstall suggests you haven't read that post.

LuisAlejandro commented 7 years ago

I'm currently in the process of understanding and replicating Debian's method for building python, which you can find in its debian/rules here:

https://sources.debian.net/src/python2.7/2.7.12-7/debian/rules/

When I'm done with it, I'll post back.

LuisAlejandro commented 7 years ago

Hi @tianon.

Perhaps you should consider downloading python source from debian and compiling from there instead of using a vanilla method.

I've put up a working script, you are free to take any ideas you like from it with proper attribution.

https://github.com/LuisAlejandro/dockershelf/blob/master/python/build-image.sh#L113

Greetings!