conda-forge / cupy-feedstock

A conda-smithy repository for cupy.
BSD 3-Clause "New" or "Revised" License
5 stars 23 forks source link

Avoid setting CFLAGS, CPPFLAGS, and CXXFLAGS #59

Closed leofang closed 4 years ago

leofang commented 4 years ago

Close #9, close #58.

Checklist

conda-forge-linter commented 4 years ago

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

leofang commented 4 years ago

@conda-forge-admin, please rerender

leofang commented 4 years ago

@isuruf @h-vetinari @jakirkham Either conda-forge/docker-images#135 or conda-forge/docker-images#136 caused all CUDA builds to fail here, see https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=163080&view=results.

In this feedstock, we rely on the fact that libcuda.so is not found in the docker image, so import cupy would fail and our test script would exist gracefully. Now it seems the driver becomes available after the changes in the docker images, so import cupy would succeed, but then calling cupy functions (which calls driver API on a system without GPU) errors out.

leofang commented 4 years ago

I found there're a bunch of recent efforts on CF that attempt to add libcuda.so...I don't understand why is that needed 🤦🏻‍♂️ On the CIs (Azure, etc), no GPU is available, so this effort at most makes something simple like import some_lib works but nothing further. However, if testing locally or running on self-hosted CIs that have GPU (we've done this recently), it'd overwrite the host driver and create chaos...

leofang commented 4 years ago

OK, tests on CUDA 9.2 passed, because there is no libcuda.so available (see https://github.com/conda-forge/conda-forge-ci-setup-feedstock/pull/93).

leofang commented 4 years ago

@jakirkham as a workaround I kinda undo #53 in 0ad5edb.

isuruf commented 4 years ago

or running on self-hosted CIs that have GPU (we've done this recently), it'd overwrite the host driver and create chaos...

It wouldn't overwrite it. Do you have a log where it would overwrite?

h-vetinari commented 4 years ago

@leofang Sorry that this broke your build. The presence of libcuda.so is IMO justified in a cuda-image (in fact they were always there in the 10.x images, just not recognised by ldconfig), and some recipes like https://github.com/conda-forge/pyarrow-feedstock/pull/101 depend on this for working.

In this feedstock, we rely on the fact that libcuda.so is not found in the docker image, so import cupy would fail and our test script would exist gracefully.

It seems like relying on a failed import does not yield much information regarding the soundness of the package - is there no other way to do this? In fact, I have the same problem (of a GPU device being necessary) in https://github.com/conda-forge/faiss-split-feedstock/pull/1 to actually test the code, but that can only be solved with CI-instance that have GPUs.

or running on self-hosted CIs that have GPU (we've done this recently), it'd overwrite the host driver and create chaos...

I agree with @isuruf that this wouldn't happen. The default installation for cuda drivers is in /usr/local/nvidia/lib:/usr/local/nvidia/lib64 (corresponding to the value of LD_LIBRARY_PATH) and any driver installed there should have precedence of the one in /usr/local/cuda/compat.

leofang commented 4 years ago

@isuruf @h-vetinari Thanks for replying!

Sorry that this broke your build. The presence of libcuda.so is IMO justified in a cuda-image (in fact they were always there in the 10.x images, just not recognised by ldconfig), and some recipes like conda-forge/pyarrow-feedstock#101 depend on this for working.

Right, but as I said, this effort at most makes something simple like import some_lib (as in your case) works but nothing further. I know it is a standard test in conda recipes, but as far as I remember none of the bugs we hit in this feedstock could be caught by this test. Feel free to ignore my ranting, though 🙂

It seems like relying on a failed import does not yield much information regarding the soundness of the package - is there no other way to do this? In fact, I have the same problem (of a GPU device being necessary) in conda-forge/faiss-split-feedstock#1 to actually test the code, but that can only be solved with CI-instance that have GPUs.

I second your request, although I have mixed views. I already left a comment in the issue https://github.com/conda-forge/conda-forge.github.io/issues/1062 you opened.

I agree with @isuruf that this wouldn't happen. The default installation for cuda drivers is in /usr/local/nvidia/lib:/usr/local/nvidia/lib64 (corresponding to the value of LD_LIBRARY_PATH) and any driver installed there should have precedence of the one in /usr/local/cuda/compat.

On the machines I have access to, they are always in /usr/lib/x86_64-linux-gnu/. The one installed in /usr/local/cuda/lib64 is just a stub for cross-compiling.

isuruf commented 4 years ago

@leofang, in the docker image, where are the libcuda.so.1s located? Can you also run ldconfig -v 2>/dev/null | grep -v ^$'\t'?

leofang commented 4 years ago

@isuruf Not sure what's this for, but here you go.

condaforge/linux-anvil-cuda:9.2:

$ find / -name libcuda.so.1
/usr/lib64/libcuda.so.1
$
$ ldconfig -v 2>/dev/null | grep -v ^$'\t'
/usr/local/cuda-9.2/targets/x86_64-linux/lib:
/lib:
/lib64:
/usr/lib:
/usr/lib64:
/lib64/tls: (hwcap: 0x8000000000000000)
/usr/lib64/sse2: (hwcap: 0x0000000004000000)
/usr/lib64/tls: (hwcap: 0x8000000000000000)

condaforge/linux-anvil-cuda:10.0:

$ find / -name libcuda.so.1
/usr/lib64/libcuda.so.1
/usr/local/cuda-10.0/compat/libcuda.so.1
$
$ ldconfig -v 2>/dev/null | grep -v ^$'\t'
/usr/local/cuda-10.0/targets/x86_64-linux/lib:
/lib:
/lib64:
/usr/lib:
/usr/lib64:
/lib64/tls: (hwcap: 0x8000000000000000)
/usr/lib64/sse2: (hwcap: 0x0000000004000000)
/usr/lib64/tls: (hwcap: 0x8000000000000000)

condaforge/linux-anvil-cuda:10.1:

$ find / -name libcuda.so.1
/usr/lib64/libcuda.so.1
/usr/local/cuda-10.1/compat/libcuda.so.1
$
$ ldconfig -v 2>/dev/null | grep -v ^$'\t'
/usr/local/cuda-10.1/targets/x86_64-linux/lib:
/lib:
/lib64:
/usr/lib:
/usr/lib64:
/lib64/tls: (hwcap: 0x8000000000000000)
/usr/lib64/sse2: (hwcap: 0x0000000004000000)
/usr/lib64/tls: (hwcap: 0x8000000000000000)

condaforge/linux-anvil-cuda:10.2:

$ find / -name libcuda.so.1
/usr/lib64/libcuda.so.1
/usr/local/cuda-10.2/compat/libcuda.so.1
$
$ ldconfig -v 2>/dev/null | grep -v ^$'\t'
/usr/local/cuda-10.2/targets/x86_64-linux/lib:
/lib:
/lib64:
/usr/lib:
/usr/lib64:
/lib64/tls: (hwcap: 0x8000000000000000)
/usr/lib64/sse2: (hwcap: 0x0000000004000000)
/usr/lib64/tls: (hwcap: 0x8000000000000000)

one of our local machines:

$ locate libcuda.so.1
/usr/lib/i386-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.1
$
$ ldconfig -v 2>/dev/null | grep -v ^$'\t'
/usr/lib/x86_64-linux-gnu/libfakeroot:
/lib/i386-linux-gnu:
/usr/lib/i386-linux-gnu:
/usr/local/lib:
/lib/x86_64-linux-gnu:
/usr/lib/x86_64-linux-gnu:
/lib:
/usr/lib:
leofang commented 4 years ago

(By the way, how to get root access in the docker images? docker run -u root doesn't work...)

isuruf commented 4 years ago

@h-vetinari, how about we add the existing paths in front of the cuda compat directory. Something like,

ldconfig -v 2>/dev/null | grep -v ^$'\t' | cut -f1 -d":" >> /etc/ld.so.conf.d/cuda-$CUDA_VER.conf ;
echo "$CUDA_HOME/compat" >> /etc/ld.so.conf.d/cuda-$CUDA_VER.conf ;

?

h-vetinari commented 4 years ago

@isuruf: @h-vetinari, how about we add the existing paths in front of the cuda compat directory.

Sure. I don't know the precedence of ldconfig vs. ld.so.conf.d, but if we're not sure that the latter comes after, then we can add the rest as well.

@leofang: (By the way, how to get root access in the docker images? docker run -u root doesn't work...)

I don't think you can. The entrypoint script switches to a user conda, and it has no elevated privileges with the exception of using yum.

isuruf commented 4 years ago

I don't know the precedence of ldconfig vs. ld.so.conf.d, but if we're not sure that the latter comes after, then we can add the rest as well.

Entries in ld.so.conf.d comes before the preconfigured directories and we need the cuda-compat to come after preconfigured directories.

h-vetinari commented 4 years ago

Entries in ld.so.conf.d comes before the preconfigured directories and we need the cuda-compat to come after preconfigured directories.

Soooooo, one more PR for the dockerfiles, I'm guessing.

leofang commented 4 years ago

Noticed something weird with our CI, let me check here...

leofang commented 4 years ago

NVM. I made a mistake in my local tests. This is good to go.

leofang commented 4 years ago

Thanks, @isuruf, for suggestions! The PR is ready to go. Let's wait a little to see if @jakirkham has any feedback.

jakirkham commented 4 years ago

Sorry was out last week. So not caught up on all of the changes that went in. Could someone please summarize? 🙂

leofang commented 4 years ago

Hi @jakirkham, in https://github.com/conda-forge/docker-images/pull/134 the CUDA headers living outside of the CUDA root were symlinked back, so your CFLAGS hacks in the recipe are no longer needed.

Moreover, libcuda.so was made available in https://github.com/conda-forge/docker-images/pull/135 and https://github.com/conda-forge/docker-images/pull/136, so import cupy would not fail, but any actual usage like cupy.show_config() would, so the test was modified.

Another change to the docker images was to deal with the linker discovery order, but I don't think it is critical here.