conda-forge / arrow-cpp-feedstock

A conda-smithy repository for arrow-cpp.
BSD 3-Clause "New" or "Revised" License
10 stars 61 forks source link

cross-compiled CUDA builds running out of disk space #1114

Closed h-vetinari closed 1 year ago

h-vetinari commented 1 year ago

I don't know what change caused this (perhaps something in the CUDA setup...), but since about a month or so, cross-compiling CUDA consistently blows through the disk space of the azure workers, failing the job in an often un-restartable way.

I've tried fixing this in various ways (#1075, #1081, 7c267128a11af2d62855898d2812297fd9ecacc0, a8ca8f793c920dad4dfb2f377a53ab886c8603d3, 555a42cdd11c50a3c7ac5d105248e62d91982525). The problem exists on both ppc & aarch, but for ppc at least, the various fixes seem to have mostly settled things, but for aarch it's still failing 9 times out of 10.

(note, the only reason I disabled aws-sdk-cpp is that jobs started failing again with migrating to a new version that had some more features enabled, and a footprint around 40MB; this is being tackled in https://github.com/conda-forge/google-cloud-cpp-feedstock/issues/141).

CC @conda-forge/cuda-compiler @jakirkham @isuruf

PS. By chance I've seen that qt also has disk space problems, and worked around this by disabling the use of precompiled headers. Arrow has an option ARROW_USE_PRECOMPILED_HEADERS, but it's already off by default.

h-vetinari commented 1 year ago

However it does kinda seem related to PCH's, in that the failure looks like:

[73/134] Building CXX object CMakeFiles/gandiva.dir/cmake_pch.hxx.gch
FAILED: CMakeFiles/gandiva.dir/cmake_pch.hxx.gch 
$BUILD_PREFIX/bin/aarch64-conda-linux-gnu-c++ -Dgandiva_EXPORTS -I$SRC_DIR/python/pyarrow/src -I$SRC_DIR/python/build/temp.linux-aarch64-cpython-311/pyarrow/src -isystem $PREFIX/include/python3.11 -isystem /home/conda/feedstock_root/build_artifacts/apache-arrow_1688985855110/_build_env/venv/lib/python3.11/site-packages/numpy/core/include -Wno-noexcept-type  -Wall -fno-semantic-interposition -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -isystem $PREFIX/include -fdebug-prefix-map=$SRC_DIR=/usr/local/src/conda/pyarrow-12.0.1 -fdebug-prefix-map=$PREFIX=/usr/local/src/conda-prefix -isystem /usr/local/cuda/targets/sbsa-linux/include -fdiagnostics-color=always  -fno-omit-frame-pointer -Wno-unused-variable -Wno-maybe-uninitialized -O3 -DNDEBUG -O2 -ftree-vectorize -std=c++17 -fPIC -Winvalid-pch -x c++-header -include $SRC_DIR/python/build/temp.linux-aarch64-cpython-311/CMakeFiles/gandiva.dir/cmake_pch.hxx -MD -MT CMakeFiles/gandiva.dir/cmake_pch.hxx.gch -MF CMakeFiles/gandiva.dir/cmake_pch.hxx.gch.d -o CMakeFiles/gandiva.dir/cmake_pch.hxx.gch -c $SRC_DIR/python/build/temp.linux-aarch64-cpython-311/CMakeFiles/gandiva.dir/cmake_pch.hxx.cxx
In file included from $SRC_DIR/python/pyarrow/src/arrow/python/platform.h:28,
                 from $SRC_DIR/python/pyarrow/src/arrow/python/pch.h:24,
                 from $SRC_DIR/python/build/temp.linux-aarch64-cpython-311/CMakeFiles/gandiva.dir/cmake_pch.hxx:5,
                 from <command-line>:
$PREFIX/include/python3.11/datetime.h:264:1: fatal error: cannot write PCH file: No space left on device
  264 | }
      | ^
compilation terminated.

CC @kou @pitrou

h-vetinari commented 1 year ago

This is now permanently blowing up our cross-compiled CUDA builds (both aarch & PPC) on 12.x & 11.x. On 10.x at least, the build passes (with the ~same fixes as mentioned in the OP, in particular with google-cloud-cpp disabled).

isuruf commented 1 year ago

Arrow has an option ARROW_USE_PRECOMPILED_HEADERS, but it's already off by default.

You are looking at arrow C++ sources, but the error is in pyarrow.

jakirkham commented 1 year ago

My hunch is that something has changed about the Azure images, which causes the amount of stuff included in them to increase (not exactly sure what changed)

Have seen this in a couple other cross-compilation CUDA builds. Though I think that is coincidental as there is simply more stuff being downloaded in those cases. Have seen disk space issues in at least one job that doesn't do any cross-compilation (though is CUDA related)

Have poked around a little bit with du & tree in this PR ( https://github.com/conda-forge/cudatoolkit-feedstock/pull/93 ), but haven't had a lot of time to do it (and haven't yet found anything that would be easy to remove). Though maybe that is a good starting point for anyone wanting to investigate this further

h-vetinari commented 1 year ago

You are looking at arrow C++ sources, but the error is in pyarrow.

I was just collecting potentially related information; that particular option is already off by default anyway, so it wasn't a serious candidate.

h-vetinari commented 1 year ago

Have seen this in a couple other cross-compilation CUDA builds. Though I think that is coincidental as there is simply more stuff being downloaded in those cases.

Yeah, the cross-compilation infra for CUDA 11 needs to download and unpack a bunch of artefacts (see https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/210). Would it make sense to try to move these builds to CUDA 12? Having any builds restricted to CUDA >=12 would still be better than having no builds at all.

h-vetinari commented 1 year ago

Would it make sense to try to move these builds to CUDA 12? Having any builds restricted to CUDA >=12 would still be better than having no builds at all.

Giving this a shot in #1120

jakirkham commented 1 year ago

Sure that seems like a reasonable approach 👍

Happy to look over things there if you need another pair of eyes 🙂

isuruf commented 1 year ago

I was just collecting potentially related information; that particular option is already off by default anyway, so it wasn't a serious candidate.

If you look at pyarrow sources, you'll see that it's not 'already off by default anyway'.

h-vetinari commented 1 year ago

If you look at pyarrow sources, you'll see that it's not 'already off by default anyway'.

Can you be more specific what you're referring to? I gave a direct link to an option that's off by default (I didn't claim it applied to pyarrow either...). In pyarrow, I don't find something using the substring PRECOMPILED_HEADER; and the only occurrence for pch does not have a switch.

isuruf commented 1 year ago

and the only occurrence for pch does not have a switch.

Exactly. Turn that off with a patch, and this issue will probably go away.

h-vetinari commented 1 year ago

That falls under the category "not obvious to me" - I can't tell if things are still expected to work without this (given that there's no option to toggle), and I'm not in the habit of patching out things I don't understand (for example, I'm confused why headers -- something pretty lightweight -- would blow through the disk space).

But I'm happy to try it, thanks for the pointer.

isuruf commented 1 year ago

Precompiled headers are not lightweight. They are heavy.

$ cat pch.h
#include <stdio.h>
$ g++ pch.h -o pch.h.gch
$ file pch.h.gch 
pch.h.gch: GCC precompiled header (version 014) for C++
$ ls -alh pch.h.gch
-rw-rw-r-- 1 isuru isuru 2.2M Jul 20 14:49 pch.h.gch
jakirkham commented 1 year ago

Maybe we should ask someone from the Arrow team to chime in?

h-vetinari commented 1 year ago

2.2M

Everything is relative of course, but I don't think 2.2MB will be the reason for us running out of disk-space on the agent.

Maybe we should ask someone from the Arrow team to chime in?

Sure. I think it's more the "fault" of our infra rather than arrow itself, but removing pyarrow's precompiled headers (i.e. viability resp. potential impact) would be good to check. Hoping you could weigh in @kou @pitrou @jorisvandenbossche @assignUser

isuruf commented 1 year ago

Everything is relative of course, but I don't think 2.2MB will be the reason for us running out of disk-space on the agent.

That's just a simple C header generating 2.2MB. Template heavy C++ headers can go up to several GBs.

h-vetinari commented 1 year ago

OK, thanks, finally I can see why this would be related. I still don't know why it would blow up so hard, but that's something I can investigate later.

assignUser commented 1 year ago

My hunch is that something has changed about the Azure images, which causes the amount of stuff included in them to increase (not exactly sure what changed)

I can add to that suspicion as some of the space heavy arrow doc builds we run on azure have started failing due to lack of space recently and we don't understand why.

Regarding the pch: my understanding is that pchs are useful to speed up build times on repeated re-builds (e.g. local development). Which is not really the case here iirc the ci setup (matrix build so each job only builds once?). So it should be fine to patch that out but it should probably also be an arrow issue to add an option for pch in pyarrow? @jorisvandenbossche

h-vetinari commented 1 year ago

Thanks for the info @assignUser!

For now, even patching out the pch didn't work (see #1122), we've now removed some unnecessary caching in our images, also to no avail.

I'm now looking at trimming some more fat in our cross-cuda setup, which I noticed is not deleting the CUDA .rpm files, which amounts to roughly 2GB of stuff.

jakirkham commented 1 year ago

Indeed thanks for the feedback! 🙏

My hunch is that something has changed about the Azure images, which causes the amount of stuff included in them to increase (not exactly sure what changed)

I can add to that suspicion as some of the space heavy arrow doc builds we run on azure have started failing due to lack of space recently and we don't understand why.

Here are some ideas of things we might remove from the Azure images ( https://github.com/conda-forge/conda-smithy/pull/1747 )