apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.91k stars 3.38k forks source link

[CI] Use out-of-source build for all languages in Docker build #41429

Open kou opened 2 months ago

kou commented 2 months ago

Describe the enhancement requested

If we use in-source build, we have files owned by root in source tree on host. Because we use root in Docker containers.

We should use out-of-source build to avoid creating files in source tree on host.

At least python/, js/ and java/ use in-source build.

Component(s)

Continuous Integration

jorisvandenbossche commented 1 month ago

Why is this needed to do an out-of-source build? Is that only relevant for artifacts that are generated we want to move out of the docker image later, like documentation artifact? But in that case, another solution can also be to only ensure those artifacts are generated outside of the source?

kou commented 1 month ago

Oh, sorry. I had a typo in the description:

-We should use out-of-source build to create files in source tree on host.
+We should use out-of-source build to avoid creating files in source tree on host.

It's for avoiding creating files in source tree on host. If files are created in Docker container, root owned files are created on host. They can't be removed by a normal user. It may break a build on host.

raulcd commented 1 month ago

For Python dev versions were we extract the version based on the git describe command it gets rather annoying to do an out of source build. We might be able to map the uid:gid of the local user to the container on docker so it maps as a non-root user on the host instead of doing out of source builds for everything.

jorisvandenbossche commented 1 month ago

It's for avoiding creating files in source tree on host.

I understood that. But my question is still: why is that needed in practice (except for artifacts like docs)? You mention "They can't be removed by a normal user. It may break a build on host.", but did we have such issues in the past? (it has been done in-source forever)

As Raúl mentions, this is quite annoying for the python build which assumes to be either in the git repo, or otherwise built from an sdist which has the version encoded in its files (but so not from a plain copy of the sources)

kou commented 1 month ago

I can't remember details but I had some problems when I use python/ in-source on host. (I used sudo rm ... or something for the case. But it may be wrong. I can't remember...) (I mix archery docker run ... (for debugging CI failures) and python3 setup.py .../python3 -m pip ... on host but others may not mix them.)

We can map uid:gid but is there any portable way for it? I hope that it's enabled by default.

41041 has the git describe related problem, right?

Can we use GIT_DIR for it?

diff --git a/ci/scripts/python_build.sh b/ci/scripts/python_build.sh
index 9455baf353..80fd417644 100755
--- a/ci/scripts/python_build.sh
+++ b/ci/scripts/python_build.sh
@@ -25,6 +25,8 @@ build_dir=${2}
 source_dir=${arrow_dir}/python
 python_build_dir=${build_dir}/python

+export GIT_DIR=${arrow_dir}
+
 : ${BUILD_DOCS_PYTHON:=OFF}

 if [ -x "$(command -v git)" ]; then

If we can remove --no-build-isolation from https://github.com/apache/arrow/blob/8169d6e719453acd0e7ca1b6f784d800cca4f113/ci/scripts/python_build.sh#L88-L92, we can remove https://github.com/apache/arrow/blob/8169d6e719453acd0e7ca1b6f784d800cca4f113/ci/scripts/python_build.sh#L81-L86 . Can we remove --no-build-isolation by #41041 ?

jorisvandenbossche commented 1 month ago

Yeah, I don't use our docker builds very often locally, so can't say much about that.

If we can remove --no-build-isolation

I would think that the build isolation should not matter for whether files are generated in the source or not (this is about whether a temporary python venv is created, or whether your current python session is used, while building), although exactly what pip/setuptools do depending on certain flags passed can be quite difficult to guess.

But, I think it should be possible to specify to pip to use a build directory that lives outside of the source (without copying the full source itself), maybe that might help? I think by default pip will create a build directory in python/build (https://github.com/pypa/pip/issues/10695)

jorisvandenbossche commented 1 month ago

Looking a bit further into it, pip was actually defaulting to an "out-of-source" build in the past, and only switched to in-tree builds by default the last two years (https://pip.pypa.io/en/stable/topics/local-project-installs/#build-artifacts). But so indeed, now it does an in-tree build and doesn't allow to specify a build directory, that's the responsibility of the build backend (setuptools) AFAIU. And for reading some issues related to this (eg https://github.com/pypa/build/issues/446, https://github.com/pypa/setuptools/issues/1816), it seems this is not easily configurable.

So in short, if we want to have the same out-of-source build as we had with older pip, it seems that you indeed need to do that manually yourself

kou commented 1 month ago

Thanks for looking into it. I see.