conda-forge / conda-forge-ci-setup-feedstock

A conda-smithy repository for conda-forge-ci-setup.
BSD 3-Clause "New" or "Revised" License
13 stars 51 forks source link

cross compile cuda support (cnt'd) #210

Closed h-vetinari closed 1 year ago

h-vetinari commented 1 year ago

New PR since I cannot commit into #209

Closes #209

conda-forge-linter commented 1 year ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

h-vetinari commented 1 year ago

Thanks for the quick review @kkraus14! Do you know whether we need the other devel packages?

kkraus14 commented 1 year ago

Thanks for the quick review @kkraus14! Do you know whether we need the other devel packages?

Yes we'll need those. Those contain the headers, the generic unversioned symlinked shared library, the static libraries that we may want specifically for CUDA because there's some implications IIRC, and the pkgconfig files.

h-vetinari commented 1 year ago

Argh, the test here cannot work because at test time, the "BUILD_PLATFORM" is already emulated.

Yes we'll need those.

Added in https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/210/commits/91610af356278676b9ee4dabc5afbc14600e1714; not sure if my scripting is worth a damn, I'm not experienced with jq, but I tested it in https://jqplay.org/ at least, and of course bash errors are a real possibility too. 😅

h-vetinari commented 1 year ago

This is way more complicated and downloads more things than needed.

Fair enough, I was just following the instructions to download packages based on the manifest. I started with downloading everything (+ what Keith told me to add), but if we can filter/shorten the list, all the better

kkraus14 commented 1 year ago

This is way more complicated and downloads more things than needed. The packages I had in my PR were the only ones in $CUDA_HOME/targets/sbsa-linux which is what the cross compiler looks at. All the other libraries are irrelevant.

I think there's a handful of other libraries that end up in the sysroot (I think?) that there aren't stubs for that we'd want as well that weren't included in your PR. I did my best to break them down in my comment in the thread.

h-vetinari commented 1 year ago

@isuruf: This is way more complicated and downloads more things than needed.

@kkraus14: There's other things that would potentially be needed [...]

I'll let you two decide what should be included.

My goal here was just to make it possible to use the versions from the manifest. As feared, this turned into a fair amount of code, due to various small and not so small divergences between the manifest and the actual RPMs. It's conceivable that someone might prefer to just hardcode the list of deps, but now that it works (w/ ability to map in additional rpms, and the obvious possibility to filter out unwanted packages), I tend to think it's probably a good thing going forward.

@isuruf, I managed to get this to run through to the point where you had added

mv ./usr/local/cuda-${CUDA_COMPILER_VERSION}/targets/${CUDA_HOST_PLATFORM_ARCH}-linux ${CUDA_HOME}/targets/${CUDA_HOST_PLATFORM_ARCH}-linux

at the end of the loop, but this fails with

mv: cannot move ‘./usr/local/cuda-11.2/targets/sbsa-linux’ to ‘/usr/local/cuda/targets/sbsa-linux’: Permission denied
h-vetinari commented 1 year ago

@isuruf @kkraus14 Can we iterate on the list of things to install?

At the moment the list is as follows:

``` { "cuda": { "name": "CUDA SDK", "version": "11.2.2" }, "cuda_cudart": { "name": "CUDA Runtime (cudart)", "version": "11.2.152" }, "cuda_cuobjdump": { "name": "cuobjdump", "version": "11.2.152" }, "cuda_cupti": { "name": "CUPTI", "version": "11.2.152" }, "cuda_cuxxfilt": { "name": "CUDA cu++ filt", "version": "11.2.152" }, "cuda_gdb": { "name": "CUDA GDB", "version": "11.2.152" }, "cuda_nvcc": { "name": "CUDA NVCC", "version": "11.2.152" }, "cuda_nvdisasm": { "name": "CUDA nvdisasm", "version": "11.2.152" }, "cuda_nvml_dev": { "name": "CUDA NVML Headers", "version": "11.2.152" }, "cuda_nvprof": { "name": "CUDA nvprof", "version": "11.2.152" }, "cuda_nvprune": { "name": "CUDA nvprune", "version": "11.2.152" }, "cuda_nvrtc": { "name": "CUDA NVRTC", "version": "11.2.152" }, "cuda_nvtx": { "name": "CUDA NVTX", "version": "11.2.152" }, "cuda_samples": { "name": "CUDA Samples", "version": "11.2.152" }, "cuda_sanitizer_api": { "name": "CUDA Compute Sanitizer API", "version": "11.2.152" }, "libcublas": { "name": "CUDA cuBLAS", "version": "11.4.1.1043" }, "libcufft": { "name": "CUDA cuFFT", "version": "10.4.1.152" }, "libcurand": { "name": "CUDA cuRAND", "version": "10.2.3.152" }, "libcusolver": { "name": "CUDA cuSOLVER", "version": "11.1.0.152" }, "libcusparse": { "name": "CUDA cuSPARSE", "version": "11.4.1.1152" }, "libnpp": { "name": "CUDA NPP", "version": "11.3.2.152" }, "libnvjpeg": { "name": "CUDA nvJPEG", "version": "11.4.0.152" }, "nsight_compute": { "name": "Nsight Compute", "version": "2020.3.1.4" }, "nsight_systems": { "name": "Nsight Systems", "version": "2020.4.3.7" }, "nvidia_driver": { "name": "NVIDIA Linux Driver", "version": "460.32.03" }, # ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ # from manifest up until this point; # entries below added by mapping # vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv "cuda_cudart_devel": { "version": "11.2.152" }, "cuda_driver_devel": { "version": "11.2.152" }, "cuda_nvrtc_devel": { "version": "11.2.152" }, "libcublas_devel": { "version": "11.4.1.1043" }, "libcufft_devel": { "version": "10.4.1.152" }, "libcurand_devel": { "version": "10.2.3.152" }, "libcusolver_devel": { "version": "11.1.0.152" }, "libcusparse_devel": { "version": "11.4.1.1152" }, "libnpp_devel": { "version": "11.3.2.152" }, "libnvjpeg_devel": { "version": "11.4.0.152" }, "nvidia_driver_devel": { "version": "460.32.03" } } ```
h-vetinari commented 1 year ago

More compact list:

cuda:11.2.2
cuda-cudart:11.2.152
cuda-cudart-devel:11.2.152
cuda-cuobjdump:11.2.152
cuda-cupti:11.2.152
cuda-cuxxfilt:11.2.152
cuda-driver-devel:11.2.152
cuda-gdb:11.2.152
cuda-nvcc:11.2.152
cuda-nvdisasm:11.2.152
cuda-nvml-devel:11.2.152
cuda-nvprof:11.2.152
cuda-nvprune:11.2.152
cuda-nvrtc:11.2.152
cuda-nvrtc-devel:11.2.152
cuda-nvtx:11.2.152
cuda-samples:11.2.152
cuda-sanitizer:11.2.152
libcublas:11.4.1.1043
libcublas-devel:11.4.1.1043
libcufft:10.4.1.152
libcufft-devel:10.4.1.152
libcurand:10.2.3.152
libcurand-devel:10.2.3.152
libcusolver:11.1.0.152
libcusolver-devel:11.1.0.152
libcusparse:11.4.1.1152
libcusparse-devel:11.4.1.1152
libnpp:11.3.2.152
libnpp-devel:11.3.2.152
libnvjpeg:11.4.0.152
libnvjpeg-devel:11.4.0.152
nsight-compute:2020.3.1.4
nsight-systems:2020.4.3.7
nvidia-driver:460.32.03
nvidia-driver-devel:460.32.03
h-vetinari commented 1 year ago

@isuruf, I managed to get this to run through to the point where you had added

mv ./usr/local/cuda-${CUDA_COMPILER_VERSION}/targets/${CUDA_HOST_PLATFORM_ARCH}-linux ${CUDA_HOME}/targets/${CUDA_HOST_PLATFORM_ARCH}-linux

at the end of the loop, but this fails with

mv: cannot move ‘./usr/local/cuda-11.2/targets/sbsa-linux’ to ‘/usr/local/cuda/targets/sbsa-linux’: Permission denied

I'm pretty sure now that this happens because I was running this in the recipe-section for debugging purposes, where we don't actually have root rights. I think that if this script is run where it should, that actually should run through. In any case, that it works up until this point shows that the downloads are fine, which means I'm marking this as ready (happy to clean up commit history if desired).

h-vetinari commented 1 year ago

I'm pretty sure now that this happens because I was running this in the recipe-section for debugging purposes, where we don't actually have root rights. I think that if this script is run where it should, that actually should run through.

This turned out to be wrong... So now I'm not sure what we need to tweak (chown?) to get the packages into $CUDA_HOME. It's a "simple" permission issue, but I don't know well enough which layer here is allowed to do what.

Also worth noting - not everything gets installed in ./usr/local/cuda-11.2, some goes in ./opt/nvidia or ./usr/share.

h-vetinari commented 1 year ago

@isuruf @kkraus14 rebased after #211 :)

h-vetinari commented 1 year ago

@isuruf @kkraus14 Is there something I can do to move this forward? There's been a couple of points (see comments above) over the last few days where I'd appreciate your insights/help.

h-vetinari commented 1 year ago

Gentle ping @kkraus14 @isuruf

kkraus14 commented 1 year ago

I'm perfectly happy if we want to reduce it to just the list of packages that @isuruf mentioned to start and then consider expanding as desired from there.

isuruf commented 1 year ago

Please use the list of package names from my original PR. The others cannot be used for cross compiling and adding complicated logic here to download them (for eg: the special case for nsight, etc) makes no sense for me.

kkraus14 commented 1 year ago

Please use the list of package names from my original PR. The others cannot be used for cross compiling and adding complicated logic here to download them (for eg: the special case for nsight, etc) makes no sense for me.

I apologize. I obviously don't know what I'm doing when it comes to cross compilation 😅

h-vetinari commented 1 year ago

Please use the list of package names from my original PR.

In that PR the instruction was to parse the information from the JSON (unclear to me at the time that this wasn't supposed to take all packages from the JSON); just to understand we're talking about the same thing now: --> the goal is still to take the versions from the JSON, but filter to the packages from your PR <--

?

isuruf commented 1 year ago

--> the goal is still to take the versions from the JSON, but filter to the packages from your PR

Yes

h-vetinari commented 1 year ago

Please use the list of package names from my original PR.

This now matches the list from #209 (modulo the versions of course, which was the point):

+++ echo 'Installing the following packages (<pkg>:<version>)'
Installing the following packages (<pkg>:<version>)
+++ cat rpms_cc.txt
cuda-cudart:11.2.152
cuda-cudart-devel:11.2.152
cuda-cupti:11.2.152
cuda-driver-devel:11.2.152
cuda-nvcc:11.2.152
cuda-nvml-devel:11.2.152
cuda-nvprof:11.2.152
cuda-nvrtc:11.2.152
cuda-nvrtc-devel:11.2.152
cuda-nvtx:11.2.152
libcublas:11.4.1.1043
libcublas-devel:11.4.1.1043
libcufft:10.4.1.152
libcufft-devel:10.4.1.152
libcurand:10.2.3.152
libcurand-devel:10.2.3.152
libcusolver:11.1.0.152
libcusolver-devel:11.1.0.152
libcusparse:11.4.1.1152
libcusparse-devel:11.4.1.1152
libnpp:11.3.2.152
libnpp-devel:11.3.2.152
libnvjpeg:11.4.0.152
libnvjpeg-devel:11.4.0.152

Permission errors for

mv ./usr/local/cuda-${CUDA_COMPILER_VERSION}/targets/${CUDA_HOST_PLATFORM_ARCH}-linux ${CUDA_HOME}/targets/${CUDA_HOST_PLATFORM_ARCH}-linux

are still there of course...

h-vetinari commented 1 year ago

@isuruf I notice that in #209 the CI was green, but I believe that's because the script wasn't actually run (at least it wasn't executable). I can revert https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/210/commits/19330637b189bf19de0008eb0eebc4eca4b09111 of course, but not sure that would solve things in practice.

h-vetinari commented 1 year ago

OK, the executable bit was not relevant here after all, but I don't get why this runs into permission errors then...

h-vetinari commented 1 year ago

@isuruf Any pointers about the permission errors for moving stuff to $CUDA_HOME?

h-vetinari commented 1 year ago

Gentle ping @isuruf

h-vetinari commented 1 year ago

@isuruf Should we push this over the finish line? Would be good to re-enable aarch+CUDA builds for arrow.

Or do you think we should wait (several month I'm guessing...?) for https://github.com/conda-forge/staged-recipes/issues/21382 to be finished?

jakirkham commented 1 year ago

The compiler package is already in a staged-recipes PR ( https://github.com/conda-forge/staged-recipes/pull/21350 ) and we are discussing how to support cross-compilation with it. Feedback would be welcome :)

There's no reason it needs to take several months to add the new packages. Though suppose it could take some time. It's more dependent on reviews than anything else. We already have a bunch of recipes ready to go

isuruf commented 1 year ago

It adds 11.8 though. We need 11.2. Are there plans to add 11.2?

h-vetinari commented 1 year ago

There's no reason it needs to take several months to add the new packages. Though suppose it could take some time. It's more dependent on reviews than anything else.

My estimate is not meant to downplay how hard people are working on this, but getting 40+ complicated, low-level packages that further interact with each other packaged up is a large undertaking, and I'm doubtful it can be done more quickly than "months", but would be happily surprised if so. In general, very happy to see this effort happening at all! :)

h-vetinari commented 1 year ago

Ping @jakirkham @isuruf At the december core meeting we had said that this is close, but I don't know how to fix those permission errors. Any ideas/instructions you could share?

conda-forge-webservices[bot] commented 1 year ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

h-vetinari commented 1 year ago

Seems that https://github.com/conda-forge/docker-images/pull/229 was not enough unfortunately:

+++ mv ./usr/local/cuda-11.2/targets/sbsa-linux /usr/local/cuda/targets/sbsa-linux
mv: cannot move ‘./usr/local/cuda-11.2/targets/sbsa-linux’ to ‘/usr/local/cuda/targets/sbsa-linux’: Permission denied

(I checked that the images being pulled in are the newest ones from here)

h-vetinari commented 1 year ago

OK cool, thanks, this progressed to the next stage now. New boss fight:

The reported errors are:
- Encountered problems while solving:
-   - nothing provides requested nvcc_linux-aarch64 11.2.*
isuruf commented 1 year ago

Also my PR has 3 commits and here there's only 2 here. Please keep my commit history.

h-vetinari commented 1 year ago

Context from an old discussion further up that's been reactivated just now:

@h-vetinari: Not sure if adding this migration is strictly speaking necessary, but since the linux builds have both cuda / non-cuda, I think we should do the same for aarch/ppc?

This reflects the changes in conda-forge/conda-forge-pinning-feedstock#3624, which we should ideally merge before-hand (then we can skip the use_local: true here).

@isuruf: Please remove this until nvcc-feedstock is ready

@h-vetinari: No problem, I'll remove it; though I wonder if we'll still be able to cross-compile cuda without this (which is the point of this PR, no?)

h-vetinari commented 1 year ago

Also my PR has 3 commits and here there's only 2 here. Please keep my commit history.

I squashed together what was a pretty obvious fix-up with its semantic parent.

Not least since I've had to rebase this a bunch of different times (and it's a cleaner history), I find this extremely nit-picky. You're free to pick my commits into your PR (squashed, non-squashed, as you wish).

isuruf commented 1 year ago

Not least since I've had to rebase this a bunch of different times (and it's a cleaner history), I find this extremely nit-picky.

It's not you. I don't like my commits being squashed at all. This has been a struggle trying to get people to not do, but I think I've convinced everybody in conda-forge/core at least to not squash my commits.

h-vetinari commented 1 year ago

This has been a struggle trying to get people to not do, but I think I've convinced everybody in conda-forge/core at least to not squash my commits.

OK, I was not aware of that. Happy to oblige in the future. But since I had to rebase them to keep up with main, the committer will still be me - is this what you want to avoid as well, or just squashing commits together in terms of content?

isuruf commented 1 year ago

The committer being you is fine. I just don't want my commit squashed.

h-vetinari commented 1 year ago

The committer being you is fine. I just don't want my commit squashed.

OK, I'll keep this in mind. For context, I often need to trawl through feedstock history, and I find sequences of commits like:

really annoying (aside from the content-free commit messages that the GH UI encourages), because they tackle one specific issue and semantically should be grouped together. This is not a dig at your commits here, I do the same during development as well, I just have the habit of cleaning up my commit history as I go resp. before merging, so that it's easier to navigate down the line.

But now that I know that you don't like that, I'll make an exception for your commits.

isuruf commented 1 year ago

If you want, you can clean up the commit message when there's no message, but keep my commits as they are in the future.

h-vetinari commented 1 year ago

@isuruf: How do you propose we solve the issue in https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/210#issuecomment-1412993900?

@h-vetinari: I don't know yet, I'm trying to help work through the issues as they appear. But the overarching goal remains to cross-compile CUDA, so if we're falling short of that, we should keep this open IMO.

So what are the next steps we need to tackle for cross-compiling CUDA, and where do we track them (if not here)?

isuruf commented 1 year ago

We can track them in nvcc-feedstock