wheels with bundled libclang #2

Open paleozogt opened 6 years ago

Currently the clang whl depends on libclang having been installed on the host system. This can be problematic if the wheel's python api mismatches the system libclang, or if its hard to get libclang installed on the system for some reason.

It would be pretty cool if the clang whl contained libclang, allowing it to be standalone and usable without any other setup.

To that end, I've started on a fork of this project that implements this feature: https://github.com/paleozogt/clang/tree/withlibs

It's integrated with GitLab CI. The latest build is here, with platform-specific whl artifacts here. It builds platform-specifc wheels for:

Mac OS X 10.11+ (libclang taken from Homebrew)
Windows 32-bit and 64-bit (libclang taken from LLVM installer exes)
manylinux1 32-bit and 64-bit (libclang taken from manylinux1 docker images)

It also runs the unit-tests for linux. While gitlab.com doesn't have Windows or Mac shared runners (yet), I have manually tested those wheels.

Wow, thank you so much for working on this! I will take a look at the configuration and the process. I'm not that familiar with gitlab ci, so it might take me a little while. A couple point off the bat:

the wheels should be versioned from this Python project's version (so 6.0.0.2), not clang itself. That allows for point releases with release/process bug fixes. As for the versioning at the moment, there isn't much of a reason not to bump to 6.0.1, since the python bindings are exactly the same.
for the manylinux wheels, it seems you are using paleozogt/manylinux1_x86_64_clang:$LLVM_VERSION, is this a custom docker container? I'd prefer to use the official ones if possible.

I don't mind switching to another CI service (Travis, etc). I went with GitLab CI because it allows for saving artifacts, which makes things simpler in some ways.
I can change the versioning, no problem. So for a 6.0.1-based wheel, it would be 6.0.1.0?
I hear what you're saying about official builds. While I was able to get official builds for mac/homebrew and windows, there doesn't seem to be anything for linux that would work on lots of distros (aka "manylinux"). I was hoping to just have CI build libclang on a manylinux docker image and include that in the whl, but building llvm/clang just takes too darn long (way longer than the free tiers of GitLab CI/Travis can tolerate). So I ended up making my own manylinux1-based Docker images that build llvm/clang, which are used by this project's CI. The source for the Docker images are here.

I don't mind switching to another CI service (Travis, etc). I went with GitLab CI because it allows for saving artifacts, which makes things simpler in some ways.

I don't have strong opinions here, though I agree being able to save artifacts is a nice thing. One project that I like which I have used before is cibuildwheel. As for saving artifacts, I have a server I can host them on, if we decide that something other than gitlab is a good choice.

I can change the versioning, no problem. So for a 6.0.1-based wheel, it would be 6.0.1.0?

It would depend on the patch version of this package. Ideally it would stay at 6.0.1.0 for example, but on occasion I might make a mistake in packaging or some such thing, therefore I would need to bump the patch version (to 6.0.1.1). Therefore, it really should be read from the setup.py or perhaps a __version__.py.

I was hoping to just have CI build libclang on a manylinux docker image and include that in the whl, but building llvm/clang just takes too darn long (way longer than the free tiers of GitLab CI/Travis can tolerate). So I ended up making my own manylinux1-based Docker images that build llvm/clang, which are used by this project's CI.

Oh, I know all too well how long clang can take to build :)

I think clang is released infrequently enough that it makes sense to just build it in the manylinux container locally once, then zip the build folder, host it, and download that when needed. Then we can both use the official docker container, and only need to build clang every once in a blue moon.

I don't have strong opinions here, though I agree being able to save artifacts is a nice thing. One project that I like which I have used before is cibuildwheel. As for saving artifacts, I have a server I can host them on, if we decide that something other than gitlab is a good choice.

cibuildwheel seems really useful for building lots of CPython variations (py27-cp27m, py27-cp27mu, etc), but since the clang wheels just use ctypes, its not bound to a particular CPython version and just uses py2.py2-none.

It does seem to have some nice uploading features, tho.

If we use your server for artifacts, Travis+Appveyor might be worth switching to, as then we could test across linux, mac, and windows. I hate producing platform-specific wheels that aren't tested. :)

It would depend on the patch version of this package. Ideally it would stay at 6.0.1.0 for example, but on occasion I might make a mistake in packaging or some such thing, therefore I would need to bump the patch version (to 6.0.1.1). Therefore, it really should be read from the setup.py or perhaps a version.py.

I've modified the build to tack on a PACKAGE_VERSION to the LLVM_VERSION that gets sent into setup.py.

I think clang is released infrequently enough that it makes sense to just build it in the manylinux container locally once, then zip the build folder, host it, and download that when needed. Then we can both use the official docker container, and only need to build clang every once in a blue moon.

I was essentially using that Docker container as a roundabout hosting of the libclang build, so this seems more straightforward. I've added a manylinux folder with build scripts that produce zips of the binary llvm distro (see here).

If you can run these scripts and put the zips on your hosting, I can point the wheel build at them.

cibuildwheel seems really useful for building lots of CPython variations (py27-cp27m, py27-cp27mu, etc), but since the clang wheels just use ctypes, its not bound to a particular CPython version and just uses py2.py2-none.

Yes, that is a good point.

If we use your server for artifacts, Travis+Appveyor might be worth switching to, as then we could test across linux, mac, and windows. I hate producing platform-specific wheels that aren't tested. :)

I think that is ideal.

I've modified the build to tack on a PACKAGE_VERSION to the LLVM_VERSION that gets sent into setup.py.

Excellent, thank you!

I was essentially using that Docker container as a roundabout hosting of the libclang build, so this seems more straightforward. I've added a manylinux folder with build scripts that produce zips of the binary llvm distro (see here). If you can run these scripts and put the zips on your hosting, I can point the wheel build at them.

I've kicked off those scripts on my server, I will comment when they are done.

I'm still looking at the best way to upload/host artifacts to my server, but I will keep you posted.

Also I'd still prefer to keep at least the setup.py in the repository root, so that the package can be installed from git (eg pip install git+https://github.com/ethanhs/clang.git).

Thank you again for working on this!

Also I'd still prefer to keep at least the setup.py in the repository root, so that the package can be installed from git (eg pip install git+https://github.com/ethanhs/clang.git).

I've refactored my branch to have setup.py at the top-level. You can now do:

pip install git+https://github.com/paleozogt/clang/@withlibs

and it will build the whl.

To make this work I moved the clang source download out of the .gitlab-ci.yml and into setup.py. At the moment this will make a whl like what you have on master-- no bundled libs.

I'd like to also move the win/mac/linux libclang download/extraction out of .gitlab-ci.yml and into setup.py, so that building it manually will also result in a whl with bundled libs.

However, I can't figure out how to read the --plat-name argument that setuptools is using from setup.py. Any ideas?

Thank you for making it install-able from git!

I'd like to also move the win/mac/linux libclang download/extraction out of .gitlab-ci.yml and into setup.py, so that building it manually will also result in a whl with bundled libs.

I think I'd rather leave it up to the person who is installing from git to have the requisite version of clang installed. I would be very scared if a package I installed started downloading 1.2GB of binaries for me, without prompting. Also keep in mind, many people have bad internet connections, data caps, and other internet issues.

Anyway, I've built the zip files, they are available at https://ethanhs.me/static/llvm-6.0.1-Linux-x86_64.zip and https://ethanhs.me/static/llvm-6.0.1-Linux-i686.zip.

Another advantage of not having the binary download happen on package install is that we can use 7zip to compress the binaries in CI. I've gone ahead and created Ultra compressed archives, which are each half the size of the zip archives.

Good point about ppl not necessarily wanting the binaries. So the build directly from git will produce the whl by downloading the clang source, but not the binaries. I'm thinking one of the CI-produced whls should also be a platform-less version.

I'm still going to move the binary downloading out of the CI script and into python (which keeps us CI-agnostic), but the download will only happen when invoked from CI.

btw, while I can download the manylinux1 binaries from your website using a browser or wget, doing it from python doesn't work. This

python -c "from urllib import urlretrieve; urlretrieve('https://ethanhs.me/static/llvm-6.0.1-Linux-x86_64.zip', 'llvm-6.0.1-Linux-x86_64.zip')"

gives an error

The owner of this website (ethanhs.me) has banned your access based on your browser's signature (4494415a69fc5023-ua48).

I've pushed a new update that has all the download/extract logic in python now, along with some more CI stuff: https://gitlab.com/paleozogt/clang/pipelines/27826391

It includes a hack to download via wget for the manylinux zips that I'll remove once the server allows for downloading from urllib.

Anyway, it now builds a platform-agnostic py2.py3-none-any whl alongside the other platform-specific whls. Also, it tests pip installing from source for both py2 and py3, so we can be sure that installing directly from github works.

Next I'm going to play around with Travis/Appveyor testing.

btw,

Another advantage of not having the binary download happen on package install is that we can use 7zip to compress the binaries in CI. I've gone ahead and created Ultra compressed archives, which are each half the size of the zip archives.

I'm not sure what this means... The binaries (.so, .dll, .dylib) go into the whl, which (I think?) has to be a zip file.

So the build directly from git will produce the whl by downloading the clang source, but not the binaries.

No, this really doesn't help. I don't think running pip install ./python setup.py install or any combination should download anything by default, including clang sources. A cleaner solution to this (IMO) is to use setuptools' extra_requires so that someone could pip install clang[bundle] (or perhaps clang[libs] and it would download and install the libclang binaries, but pip install clang would be as it is currently, just the bindings.

I'm thinking one of the CI-produced whls should also be a platform-less version. Anyway, it now builds a platform-agnostic py2.py3-none-any whl alongside the other platform-specific whls.

This is an excellent idea! Thank you.

doing it from python doesn't work

I'm not sure exactly what is causing this, I went though all of my configurations and nothing seemed amiss. I'd recommend using a fake user-agent string for now, though I will keep looking.

...we can use 7zip... I'm not sure what this means...

Ah, do you intend to download the full zip of clang binaries, or individual binaries? If the former, downloading a 7z of the clang binaries would be faster, instead of a zip of the clang binaries.

I'm probably starting to sound like a broken record, but I do really appreciate all the work you are putting into this, it is greatly appreciated :)

No, this really doesn't help. I don't think running pip install ./python setup.py install or any combination should download anything by default, including clang sources.

Ah, I see how we got our wires crossed. I removed all the clang sources from my fork and changed it to download everything (clang python bindings and binaries). The reason I did this was so that maintenance would be easier-- releasing is literally just bumping the version number in version.txt. If you don't want pip install . to download, that means we have to manually extract the clang bindings and keep them in sync in git with the lib version, which seems like extra work and opens the possibility of them getting out of sync.

It's not really my call, tho-- if you feel like the extra release overhead is worth it, I can change it to be a hybrid build. That is, clang python bindings would be in git, but the binaries would get downloaded.

Ah, do you intend to download the full zip of clang binaries, or individual binaries? If the former, downloading a 7z of the clang binaries would be faster, instead of a zip of the clang binaries.

At the moment the platform-specific builds download the full zip of clang binaries and extracts what's needed. Its definitely downloading more than is strictly necessary, but it doesn't take very long. It seems simpler than having us manually extract the binaries and put them on a server. And pulling the binaries from their official URLs feels safer-- less opportunities for security shenanigans.

I'm probably starting to sound like a broken record, but I do really appreciate all the work you are putting into this, it is greatly appreciated :)

No problem-- it's a fun little project to figure out. And when its done it'll be useful for me, so its not entirely altruistic. :) Thanks for keeping me on the straight and narrow!

btw, I've done some more work on it. CI now uses the best (?) of three worlds:

GitLab for artifacts and linux testing
Travis CI for mac testing
AppVeyor for win testing

The way it works is that GitHub commits kick off the GitLab jobs (see here), which build the whls for all platforms and does multi-platform/multi-python linux testing. Then it POSTs to Travis and AppVeyor, which download the GitLab whl artifacts and does Mac and Windows testing (see here and here). The Mac testing is across multiple OS X versions along with python 2/3, while the Windows tests are across 32/64-bit and python 2/3.

Sadly, I don't know how to get Travis/AppVeyor to communicate their test results back to GitLab. So when releasing you'd have to visually check that all three passed. But that doesn't seem too bad, considering how well-tested it'll be.

I realize this is a little Rube-Goldberg. The fact that there's no open-source friendly CI that does win/mac/linux is a minor travesty. 😢

I've used this for windows+mac+linux with some success: https://cirrus-ci.org/ Can't speak to the quality though, it was just a dead simple toy project.

Seconding pip install clang[libs] scheme, that seems ideal to me. Thanks for working on this! Very cool.

@noahp Thanks for the link! I'll give CirrusCI a look. Getting this all running on a single CI service instead my current rube-goldberg machine would be great.

@paleozogt sorry I haven't gotten back to this sooner, I've been otherwise occupied. I was actually going to suggest perhaps trying Azure CI/CD. It seems they have the trifecta of Mac/Linux/Windows environments, and I trust them to stick around more than CirrusCI (which I have never heard of...).

As for an optimal workflow, I'm fine with manually extracting the bindings from the clang root. It only needs to happen once or twice a year or so. It might make sense to have a separate distribution get installed for the libraries, that way it can be added as an extras_require like @noahp suggested. It will require the binaries to be built and served somewhere, but since builds happen so rarely I doubt it will be a hassle.

@noahp Sadly CirrusCI lacks artifacts. :(

@ethanhs I dug into Azure CI a bit, and it seems they have the full venn diagram: (a) free for open source (b) win/mac/linux with docker and (c) artifact saving. I'm going to start moving over to it.

ethanhs / clang

wheels with bundled libclang #2