Open hmaarrfk opened 6 months ago
Thanks for raising Mark! 🙏
nvvm
is actually the expected location for these files
In CUDA 12, the nvvm
contents match the CUDA Toolkit layout. In CUDA 11, the cudatoolkit
package is not matching this layout ( https://github.com/conda-forge/cudatoolkit-feedstock/issues/96 )
Hence libdevice
and other bits wind up in the wrong place in the cudatoolkit
package. Think we discussed this before in issue ( https://github.com/conda-forge/tensorflow-feedstock/issues/296 ) where cudatoolkit
package layout issues had cropped up
With cicc
itself, it is typically used by nvcc
(not usually external programs)
Have seen one other case where cicc
was not found, but after further investigation it was due to some build configuration issues ( https://github.com/scopetools/cudadecon/pull/29 )
So am wondering if there is a similar issue here. Do you have more context on the issue that came up?
Tensorflow 2.15 and cuda builds is where it came up
Ok is there a log or something we could look at?
Not really since we disable building tf on the cis . you can see how I modified the build script though.
but I’ll upload something tomorrow
https://github.com/conda-forge/tensorflow-feedstock/pull/366
edit: this comment in particular shows a small portion of the log https://github.com/conda-forge/tensorflow-feedstock/pull/366#issuecomment-1872621256
Completely understandable
An uploaded log would work. Happy to look at snippets too
We might consider setting up TensorFlow on the Quansight CI as well to make that a bit easier to manage
Found this (admittedly old) thread, which mentions cicc
may need to be in the search path
This logic should add NVVM's bin
directory to the $PATH
This logic should add NVVM's
bin
directory to the$PATH
Is there any action we need in this feedstock?
i'm not sure. happy to revisit in the future.
I haven't had time to go through the tensorflow builds in a long time.
I'm seeing the same issue as in https://github.com/scopetools/cudadecon/pull/29, in a similar situation with old CUDA code using CMake. And I can reproduce it without calling cicc
directly:
conda create -n test
conda activate test
conda install cuda-toolkit
touch source.cu
${CONDA_PREFIX}/bin/nvcc -c source.cu # works
${CONDA_PREFIX}/targets/x86_64-linux/bin/nvcc -c source.cu
<command-line>: fatal error: cuda_runtime.h: No such file or directory
${CONDA_PREFIX}/targets/x86_64-linux/bin/nvcc -c -I${CONDA_PREFIX}/targets/x86_64-linux/include source.cu
sh: 1: cicc: not found
For the failing call, strace
tells me:
openat(AT_FDCWD, "${CONDA_PREFIX}/targets/x86_64-linux/bin/nvcc.profile", O_RDONLY) = -1 ENOENT (No such file or directory)
while for the successful one it says:
openat(AT_FDCWD, "${CONDA_PREFIX}/bin/nvcc.profile", O_RDONLY) = 3
This latter file contains the line
CICC_PATH = $(TOP)/nvvm/bin
which explains why cicc
isn't found, I think.
So if nvcc
tries to find its configuration in a location relative to itself, perhaps the symlink for nvcc
should be accompanied by one for nvcc.profile
?
And a little more digging: CMake runs nvcc -v __cmake_determine_cuda
, which prints the configuration as created from nvcc.profile
and then errors out. This has the line
#$ TOP=${CONDA_PREFIX}/bin/../targets/x86_64-linux
which CMake then uses to locate nvcc
at ${CONDA_PREFIX}/targets/x86_64-linux/bin/nvcc
, from where it can't find its configuration.
So it's the nvcc.profile
itself that points CMake to a version of nvcc
that cannot read nvcc.profile
:smile:.
Adding a symlink at ${CONDA_PREFIX)/targets/x86_64-linux/bin/nvcc.profile
to ${CONDA_PREFIX}/bin/nvcc.profile
fixes the problem.
@robertmaynard @adibbley do you have insights for what Lorens brought up above?
I've now also added a symlink for the bin/crt
directory, to avoid errors linking code that uses the driver API with the stubs. I have more issues still, but the code I'm working on is also messy so they may be unrelated.
What Cmake version are you using? This sounds like an older version of CMake that didn't properly handle symlinks inside TOP
and has been fixed
This is a new CMake, but with an old configuration that uses the now-obsolete FindCUDA macro.
But my first example reproduces the problem without CMake being involved in any way. Are you saying that users are expected to first resolve the symlink at ${CONDA_PREFIX}/targets/x86_64-linux/bin/nvcc
, rather than trying to run it directly as if it were the linked-to executable?
But my first example reproduces the problem without CMake being involved in any way. Are you saying that users are expected to first resolve the symlink at
${CONDA_PREFIX}/targets/x86_64-linux/bin/nvcc
, rather than trying to run it directly as if it were the linked-to executable?
After looking at this more the issue is entirely due to a bad setup by conda. You are correct that a nvcc.profile
needs to be beside the nvcc
symlink in ${CONDA_PREFIX}/targets/x86_64-linux/bin/
.
In the current form the nvcc
at ${CONDA_PREFIX}/targets/x86_64-linux/bin/
is broken and the verbose output from the compiler looks like:
#$ NVCC_PREPEND_FLAGS=" -ccbin=/home/rmaynard/miniconda3/envs/cuda_stub_env/bin/x86_64-conda-linux-gnu-c++"
#$ _NVVM_BRANCH_=nvvm
#$ _SPACE_=
#$ _CUDART_=cudart
#$ _HERE_=/home/rmaynard/miniconda3/envs/cuda_stub_env/targets/x86_64-linux/bin
#$ _THERE_=/home/rmaynard/miniconda3/envs/cuda_stub_env/targets/x86_64-linux/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_SIZE_=64
#$ "/home/rmaynard/miniconda3/envs/cuda_stub_env/bin"/x86_64-conda-linux-gnu-c++ ....
When I symlink the nvcc.profile as well into targets/x86_64-linux/bin
I see proper paths for the crt
headers being included and a simple test case properly finds them.
@leofang @adibbley We need to create a nvcc.profile
symlink like we do for targets/x86_64-linux/bin/nvcc
@LourensVeen The only reason that ${CONDA_PREFIX}/targets/x86_64-linux/bin/nvcc
exists is to support legacy CMake versions where the FindCUDA
or FindCUDAToolkit
would validate the CUDA Toolkit layout by searching for a nvcc
executable under bin
. Therefore we have that symlink so that targets/x86_64-linux/
matches the checked layout.
But I also believe that if we are going to offer a symlink to the compiler it should work so we don't give footguns to users
Edit: So at some point expect targets/x86_64-linux/bin/nvcc
to go away and the only nvcc compiler to be in <prefix>/bin
Okay, that makes sense to me. I'll be updating that CMake config.
You need to symlink bin/crt
as well to make nvcc
work if you want a temporary solution.
Solution to issue cannot be found in the documentation.
Issue
cicc seems to be in
${PREFIX}/nvvm/bin
instead of${PREFIX}/bin
so does libdevice10..bc
xref: https://github.com/conda-forge/tensorflow-feedstock/issues/296
Installed packages
Environment info