Open pitrou opened 2 months ago
cc @kou @raulcd @assignUser
It's a bit tricky to do this within docker but should be doable, there is a similar issue open for java-jars cc @danepitkin We are using vcpkg binary caching for the MacOS jobs #cache vcpkg (see https://github.com/apache/arrow/pull/43438 and #43434)
FWIW got vcpkg caching working within cibuildwheel here:
It mostly seems to depend on passing:
into the container
And also VCPKG_BINARY_SOURCES="clear;x-gha,readwrite"
I suppose.
By the way, it seems other sources of binary artifacts are supported: https://github.com/microsoft/vcpkg-docs/blob/main/vcpkg/reference/binarycaching.md#configuration-syntax
It's a bit tricky to do this within docker but should be doable
Also, CIBW_CONTAINER_ENGINE: "docker; create_args: --network=host"
might help vcpkg access a cache external to the container. This is useful in other CI's but I haven't needed this in GHA.
especially as we always build the same dependency versions regardless of the targeted Python version.
We would also consider building all wheels for the various Python versions in a single build (which is what typically happens when eg using cibuildwheel). It would make a single build longer of course, but reduce the overall CI time.
Now, our pyarrow build and test run take quite a while, so maybe this will get too long for a single build
It's quite bad for developer productivity to make the wheel build slower. I would rather we make the existing builds faster.
Currently, when the vcpkg step runs, a manylinux wheel build run takes 1h15. When the vcpkg step is cached in the Docker image, a manylinux wheel build run takes 20 minutes. vcpkg binary caching would hopefully achieve similar results (probably not as good, but still).
a manylinux wheel build run takes 20 minutes
Of which half is setting up the image and building Arrow C++, which also strictly does not need to be repeated for every Python version.
But yes, if there is a build failing for a specific Python version, it would be annoying they are all combined and you couldn't easily trigger a single Python version (it's always a trade-off)
Building Arrow C++ (and perhaps PyArrow) could be made faster using ccache/sccache. Apparently that's not the case currently: https://github.com/apache/arrow/blob/f545b90748d5196af547abcec19d63a7b14e4daa/dev/tasks/python-wheels/github.linux.yml
We could consolidate all wheels into a single workflow (or one per os), where arrow C++ and deps are built once with best possible caching and the artifacts distributed to multiple wheel build jobs, I think this would be the best compromise of over all runtime and efficent use of CI time.
a build failing for a specific Python version
Thinking back on last few releases I think mostly all wheel jobs fail together vs. issues with a specific action, excluding maybe RCs of new python versions.
We could consolidate all wheels into a single workflow (or one per os), where arrow C++ and deps are built once with best possible caching and the artifacts distributed to multiple wheel build jobs, I think this would be the best compromise of over all runtime and efficent use of CI time
That would also make local reproduction using archery docker ...
more difficult, unless there's a way to automate that too.
Also, perhaps there could be several sources: a GHA one and a file-based one as fallback (if not on GHA?).
Something like: VCPKG_BINARY_SOURCES=clear;x-gha,rw;files,/vcpkg-cache,rw
with /vcpkg-cache
being mapped as a Docker volume to the host's ~/.cache/vcpkg
?
Describe the enhancement requested
We use vcpkg to build bundled dependencies for Python wheels. Unfortunately, it often happens that the Docker image gets rebuilt, and therefore all the dependencies are recompiled from scratch. This makes build times very long (random example here).
It would be nice to use a vcpkg binary cache on CI here, especially as we always build the same dependency versions regardless of the targeted Python version. There are exemples here: https://learn.microsoft.com/en-us/vcpkg/consume/binary-caching-github-actions-cache
Component(s)
C++, Continuous Integration, Python