apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.52k stars 3.53k forks source link

[CI][Python] Improve vcpkg caching #43951

Open pitrou opened 2 months ago

pitrou commented 2 months ago

Describe the enhancement requested

We use vcpkg to build bundled dependencies for Python wheels. Unfortunately, it often happens that the Docker image gets rebuilt, and therefore all the dependencies are recompiled from scratch. This makes build times very long (random example here).

It would be nice to use a vcpkg binary cache on CI here, especially as we always build the same dependency versions regardless of the targeted Python version. There are exemples here: https://learn.microsoft.com/en-us/vcpkg/consume/binary-caching-github-actions-cache

Component(s)

C++, Continuous Integration, Python

pitrou commented 2 months ago

cc @kou @raulcd @assignUser

assignUser commented 2 months ago

It's a bit tricky to do this within docker but should be doable, there is a similar issue open for java-jars cc @danepitkin We are using vcpkg binary caching for the MacOS jobs #cache vcpkg (see https://github.com/apache/arrow/pull/43438 and #43434)

sjperkins commented 2 months ago

FWIW got vcpkg caching working within cibuildwheel here:

https://github.com/ratt-ru/arcae/blob/cd3e7e8f7057a66aad7fedf7e4adf18334fbf2c9/.github/workflows/ci.yml#L157-L194

It mostly seems to depend on passing:

into the container

pitrou commented 2 months ago

And also VCPKG_BINARY_SOURCES="clear;x-gha,readwrite" I suppose.

pitrou commented 2 months ago

By the way, it seems other sources of binary artifacts are supported: https://github.com/microsoft/vcpkg-docs/blob/main/vcpkg/reference/binarycaching.md#configuration-syntax

sjperkins commented 2 months ago

It's a bit tricky to do this within docker but should be doable

Also, CIBW_CONTAINER_ENGINE: "docker; create_args: --network=host" might help vcpkg access a cache external to the container. This is useful in other CI's but I haven't needed this in GHA.

jorisvandenbossche commented 2 months ago

especially as we always build the same dependency versions regardless of the targeted Python version.

We would also consider building all wheels for the various Python versions in a single build (which is what typically happens when eg using cibuildwheel). It would make a single build longer of course, but reduce the overall CI time.

Now, our pyarrow build and test run take quite a while, so maybe this will get too long for a single build

pitrou commented 2 months ago

It's quite bad for developer productivity to make the wheel build slower. I would rather we make the existing builds faster.

Currently, when the vcpkg step runs, a manylinux wheel build run takes 1h15. When the vcpkg step is cached in the Docker image, a manylinux wheel build run takes 20 minutes. vcpkg binary caching would hopefully achieve similar results (probably not as good, but still).

jorisvandenbossche commented 2 months ago

a manylinux wheel build run takes 20 minutes

Of which half is setting up the image and building Arrow C++, which also strictly does not need to be repeated for every Python version.

But yes, if there is a build failing for a specific Python version, it would be annoying they are all combined and you couldn't easily trigger a single Python version (it's always a trade-off)

pitrou commented 2 months ago

Building Arrow C++ (and perhaps PyArrow) could be made faster using ccache/sccache. Apparently that's not the case currently: https://github.com/apache/arrow/blob/f545b90748d5196af547abcec19d63a7b14e4daa/dev/tasks/python-wheels/github.linux.yml

assignUser commented 2 months ago

We could consolidate all wheels into a single workflow (or one per os), where arrow C++ and deps are built once with best possible caching and the artifacts distributed to multiple wheel build jobs, I think this would be the best compromise of over all runtime and efficent use of CI time.

assignUser commented 2 months ago

a build failing for a specific Python version

Thinking back on last few releases I think mostly all wheel jobs fail together vs. issues with a specific action, excluding maybe RCs of new python versions.

pitrou commented 2 months ago

We could consolidate all wheels into a single workflow (or one per os), where arrow C++ and deps are built once with best possible caching and the artifacts distributed to multiple wheel build jobs, I think this would be the best compromise of over all runtime and efficent use of CI time

That would also make local reproduction using archery docker ... more difficult, unless there's a way to automate that too.

pitrou commented 1 month ago

Also, perhaps there could be several sources: a GHA one and a file-based one as fallback (if not on GHA?).

Something like: VCPKG_BINARY_SOURCES=clear;x-gha,rw;files,/vcpkg-cache,rw with /vcpkg-cache being mapped as a Docker volume to the host's ~/.cache/vcpkg?