facebookresearch / TensorComprehensions

A domain specific language to express machine learning workloads.
https://facebookresearch.github.io/TensorComprehensions/
Apache License 2.0
1.76k stars 211 forks source link

[Build] issues finding Cuda, incomplete config #407

Open keightyfive opened 6 years ago

keightyfive commented 6 years ago

Hi,

I want to build TC from source on a cluster (https://www.macs.hw.ac.uk/~hv15/robotarium/about) and run some of the benchmarks, here's the information regarding my working environment:

I have followed all the steps to build from source (non-conda env) and have finally executed the cmd BUILD_TYPE=Release PYTHON=$(which python3) WITH_CAFFE2=OFF CLANG_PREFIX=$HOME/.apps/tc ./build.sh --all whereas ~/.apps/tc is of course a directory I have created locally since I'm working on a cluster and don't want to install stuff on the head node.

I get the following error:

[ 5%] Building CXX object src/ATen/cpu/tbb/CMakeFiles/tbb_static.dir/tbb_remote/src/tbb/tbb_misc.cpp.o [ 6%] Linking CXX static library libtbb_static.a [ 14%] Built target tbb_static make[2]: No rule to make target /usr/lib64/liblapack.so', needed bysrc/ATen/libATen.so'. Stop. make[1]: [src/ATen/CMakeFiles/ATen.dir/all] Error 2 make: *** [all] Error 2

LAPACK is installed however. The problem seems to be that on the cluster, LAPACK is exposed via the environment variables only (LD_LIBRARY_PATH etc.) and the CMake build script seems to expect LAPACK to be installed at /usr/lib/... which it is not...

If you guys could help me out with a patch or another quick fix I'd be grateful.

Cheers Kevin

skimo-openhub commented 6 years ago

On Mon, May 07, 2018 at 07:06:24PM +0000, Kevin Klein wrote:

I get the following error:

[ 5%] Building CXX object src/ATen/cpu/tbb/CMakeFiles/tbb_static.dir/tbb_remote/src/tbb/tbb_misc.cpp.o [ 6%] Linking CXX static library libtbb_static.a [ 14%] Built target tbb_static make[2]: No rule to make target /usr/lib64/liblapack.so', needed bysrc/ATen/libATen.so'. Stop. make[1]: [src/ATen/CMakeFiles/ATen.dir/all] Error 2 make: *** [all] Error 2

This looks like an issue in the pytorch submodule. See if you can reproduce by compiling pytorch on its own and then report it there.

skimo

keightyfive commented 6 years ago

I am now installing everything according to the new installation guide (https://facebookresearch.github.io/TensorComprehensions/installation.html). All worked fine so far, now all I have to do is the final step:

CLANG_PREFIX=$(${CONDA_PREFIX}/bin/llvm-config --prefix) ./build.sh

but in which directory do I execute the above cmd? Do I just clone the github repo? I'm getting a few CMAKE errors:

Found PROTOBUF_LIBRARIES:PROTOBUF_LIBRARIES-NOTFOUND Found PROTOBUF_PROTOC_EXECUTABLE:PROTOBUF_PROTOC_EXECUTABLE-NOTFOUND Found PROTOBUF_INCLUDES:PROTOBUF_INCLUDES-NOTFOUND CMake Error at CMakeLists.txt:102 (add_subdirectory): add_subdirectory given source "third-party/googlelibraries/gflags" which is not an existing directory.

CMake Error at CMakeLists.txt:114 (add_subdirectory): add_subdirectory given source "third-party/googlelibraries/glog" which is not an existing directory.

CMake Error at CMakeLists.txt:124 (add_subdirectory): add_subdirectory given source "third-party/googlelibraries/googletest" which is not an existing directory.

CMake Error at /home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/share/cmake-3.11/Modules/FindCUDA.cmake:687 (message): Specify CUDA_TOOLKIT_ROOT_DIR Call Stack (most recent call first): CMakeLists.txt:139 (find_package)

CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: PROTOBUF_INCLUDES used as include directory in directory /home/kklein/.apps/TensorComprehensions used as include directory in directory /home/kklein/.apps/TensorComprehensions

-- Configuring incomplete, errors occurred! See also "/home/kklein/.apps/TensorComprehensions/build/CMakeFiles/CMakeOutput.log".

Cheers, Kevin

skimo-openhub commented 6 years ago

On Sun, Jun 24, 2018 at 05:25:48PM -0700, Kevin Klein wrote:

I am now installed everything according to the new installation guide (https://facebookresearch.github.io/TensorComprehensions/installation.html). Now all I have to do is the final step:

CLANG_PREFIX=$(${CONDA_PREFIX}/bin/llvm-config --prefix) ./build.sh

Hmmm... it seems it doesn't explicitly mention that you need to clone TC:

git clone http://www.github.com/facebookresearch/TensorComprehensions --recursive

in which directory do I execute the above cmd?

In the top-level of the cloned git repo.

It complain it can't find the buiild script, I don't know where it is located either... also I'm a bit confused what CLANG_PREFIX and CONDA_PREFIX are supposed to be in this context.

I think CONDA_PREFIX is set by conda activate. clang is installed there through

conda install -y -c nicolasvasilache llvm-tapir50

skimo

keightyfive commented 6 years ago

Thanks,

I did realise shortly after I posted this that you have to clone from Git, although I didn't use --recursive.

I then executed the cmd when in conda activate, but I got the following errors:

Found PROTOBUF_LIBRARIES:PROTOBUF_LIBRARIES-NOTFOUND Found PROTOBUF_PROTOC_EXECUTABLE:PROTOBUF_PROTOC_EXECUTABLE-NOTFOUND Found PROTOBUF_INCLUDES:PROTOBUF_INCLUDES-NOTFOUND CMake Error at CMakeLists.txt:102 (add_subdirectory): add_subdirectory given source "third-party/googlelibraries/gflags" which is not an existing directory.

CMake Error at CMakeLists.txt:114 (add_subdirectory): add_subdirectory given source "third-party/googlelibraries/glog" which is not an existing directory.

CMake Error at CMakeLists.txt:124 (add_subdirectory): add_subdirectory given source "third-party/googlelibraries/googletest" which is not an existing directory.

CMake Error at /home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/share/cmake-3.11/Modules/FindCUDA.cmake:687 (message): Specify CUDA_TOOLKIT_ROOT_DIR Call Stack (most recent call first): CMakeLists.txt:139 (find_package)

CMake Error: The following variables are used in this project, but they are set to NOTFOUND. Please set them or make sure they are set and tested correctly in the CMake files: PROTOBUF_INCLUDES used as include directory in directory /home/kklein/.apps/TensorComprehensions used as include directory in directory /home/kklein/.apps/TensorComprehensions

-- Configuring incomplete, errors occurred! See also "/home/kklein/.apps/TensorComprehensions/build/CMakeFiles/CMakeOutput.log".

exit 1
skimo-openhub commented 6 years ago

On Mon, Jun 25, 2018 at 09:50:54AM -0700, Kevin Klein wrote:

Thanks,

I did realise shortly after I posted this that you have to clone from Git, although I didn't use --recursive.

You should.

I then executed the cmd when in conda activate, but I got the following errors:

Found PROTOBUF_LIBRARIES:PROTOBUF_LIBRARIES-NOTFOUND Found PROTOBUF_PROTOC_EXECUTABLE:PROTOBUF_PROTOC_EXECUTABLE-NOTFOUND Found PROTOBUF_INCLUDES:PROTOBUF_INCLUDES-NOTFOUND CMake Error at CMakeLists.txt:102 (add_subdirectory): add_subdirectory given source "third-party/googlelibraries/gflags" which is not an existing directory.

CMake Error at CMakeLists.txt:114 (add_subdirectory): add_subdirectory given source "third-party/googlelibraries/glog" which is not an existing directory.

CMake Error at CMakeLists.txt:124 (add_subdirectory): add_subdirectory given source "third-party/googlelibraries/googletest" which is not an existing directory.

This is because you didn't use "--recursive".

CMake Error at /home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/share/cmake-3.11/Modules/FindCUDA.cmake:687 (message): Specify CUDA_TOOLKIT_ROOT_DIR

Did you set CUDA_TOOLKIT_ROOT_DIR?

skimo

keightyfive commented 6 years ago

Thanks... I didn't know checking out recursively was necessary... I set CUDA_TOOLKIT_ROOT_DIR to the path where Cuda90 is installed in my .bashrc file... I assume that was not what you're supposed to do?

skimo-openhub commented 6 years ago

On Mon, Jun 25, 2018 at 11:48:05AM -0700, Kevin Klein wrote:

Thanks... I didn't know checking out recursively was necessary... I set CUDA_TOOLKIT_ROOT_DIR to the path where Cuda90 is installed in my .bashrc file... I assume that was not what you're supposed to do?

It doesn't matter that much where you set CUDA_TOOLKIT_ROOT_DIR, just as long as you make sure it is set when ./build.sh gets called.

skimo

keightyfive commented 6 years ago

I checked out the repo with --recursive this time, and it seems to build fine until the point where it still doesn't find the Cuda stuff... I am on a SLURM based cluster where you have to load CUDA as a module. I have specified where it is in my .bashrc file and I'm pretty sure this is the correct path (/cm/shared/modulefiles/cuda90/toolkit/9.0.176). I printed with echo $CUDA_TOOLKIT_ROOT_DIR and it is clearly the same path, but it still gives me this error:

make[2]: [google/protobuf/descriptor.lo] Error 1 make[2]: Waiting for unfinished jobs.... google/protobuf/compiler/cpp/cpp_message_field.cc:54:8: warning: ‘std::__cxx11::string google::protobuf::compiler::cpp::{anonymous}::StaticCast(const string&, const string&, bool)’ defined but not used [-Wunused-function] string StaticCast(const string& type, const string& expression, ^~~~~~ make[2]: Leaving directory /home/kklein/.apps/tc_0_1_1/TensorComprehensions/third-party/googlelibraries/protobuf-3.5.2/src' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory/home/kklein/.apps/tc_0_1_1/TensorComprehensions/third-party/googlelibraries/protobuf-3.5.2' make: *** [all] Error 2 Found PROTOBUF_LIBRARIES:PROTOBUF_LIBRARIES-NOTFOUND Found PROTOBUF_PROTOC_EXECUTABLE:PROTOBUF_PROTOC_EXECUTABLE-NOTFOUND Found PROTOBUF_INCLUDES:/home/kklein/.apps/tc_0_1_1/TensorComprehensions/third-party/googlelibraries/protobuf-3.5.2/src CMake Error at /home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/share/cmake-3.11/Modules/FindCUDA.cmake:687 (message): Specify CUDA_TOOLKIT_ROOT_DIR Call Stack (most recent call first): CMakeLists.txt:139 (find_package)

-- Configuring incomplete, errors occurred! See also "/home/kklein/.apps/tc_0_1_1/TensorComprehensions/build/CMakeFiles/CMakeOutput.log". See also "/home/kklein/.apps/tc_0_1_1/TensorComprehensions/build/CMakeFiles/CMakeError.log".

I've tried all sorts of paths with and without loading the module, and it still doesn't know where it is specified...

keightyfive commented 6 years ago

Update: I tried again with "export" in my .bashrc (export CUDA_TOOLKIT_ROOT_DIR="/cm/shared/modulefiles/cuda90/toolkit/9.0.176") and I think it worked... I'm not sure if the build was entirely clean, as still gives me the error message at the very end:

+++ dirname ./build.sh ++ cd . ++ pwd

-- Configuring incomplete, errors occurred! See also "/home/kklein/.apps/tc_0_1_1/TensorComprehensions/build/CMakeFiles/CMakeOutput.log". See also "/home/kklein/.apps/tc_0_1_1/TensorComprehensions/build/CMakeFiles/CMakeError.log".

When I execute the test.sh on one of the nodes it says: find: ‘build/tc/benchmarks’: No such file or directory SUCCESS

... So what now? Is this a success or not? :D

skimo-openhub commented 6 years ago

On Tue, Jun 26, 2018 at 01:36:29PM -0700, Kevin Klein wrote:

-- Found CUDA: /cm/shared/modulefiles/cuda90/toolkit/9.0.176 (found version "9.0") -- Automatic GPU detection failed. Building for common architectures.

This looks suspicious. Do you have a CUDA card? Try and find out why the detection failed.

-- Autodetected CUDA architecture(s): 3.0;3.5;5.0;5.2;6.0;6.1;6.1+PTX -- Could NOT find CUDNN (missing: CUDNN_INCLUDE_DIR CUDNN_LIBRARY)

You probably need this too. Either on your system or in conda.

-- Looking for LLVM in /home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/lib/cmake/llvm -- Found LLVM 5.0.0git-ec3ad2b -- Using LLVMConfig.cmake in: /home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/lib/cmake/llvm -- Found PythonLibs: /home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/lib/libpython3.6m.so -- pybind11 v2.2.1 -- PYTHON output: /home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/bin/python -- IMPORTING TORCH: 0 -- PYTHON site packages: /home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/lib/python3.6/site-packages -- TORCH INSTALLED, linking to ATen from PyTorch -- Found ATen.so file: ATEN_LIBRARIES-NOTFOUND

This one, you definitely need. If you installed the pytorch conda package, you should have one in /home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/lib/python3.6/site-packages/orch/lib/libATen.so

Do you?

Btw, how is this related to LAPACK? You may want to rename the issue or start a new one.

skimo

keightyfive commented 6 years ago

Hi, yes, sorry, the LAPACK issue was when I tried building before the build system got a makeover. I changed the title now. I'm on a cluster with many different nodes with various NVidia GPUs (https://www.macs.hw.ac.uk/~hv15/robotarium/), I'm not sure why the auto detection failed - I'd have to ask the admin. I think we have Ubuntu running on the dgx-1 node, but a redhat distribution (scientific linux) running on the other nodes. I can also load CUDNN as a module... we have 5.1, 6.0 and 7.0 available. I can set the path in the .bashrc file too if necessary. I did install the pytorch conda package, the path is there... the file is calles libATen.so.1. So now back to happy hacking...

Cheers, Kevin

ftynse commented 6 years ago

I copied this from the Slack channel:

Hi I am trying the new build on a SLURM based cluster (https://www.macs.hw.ac.uk/~hv15/robotarium/)... I'm having the following 3 issues: the automatic GPU detection fails, CUDNN can not be found and ATen can not be found. I set both the paths to Cuda and CUDNN in my .bashrc (export CUDA_TOOLKIT_ROOT_DIR="/cm/shared/modulefiles/cuda90/toolkit/9.0.176", export CUDNN_INCLUDE_DIR CUDNN_LIBRARY="/cm/shared/modulefiles/cudnn/6.0") and loaded them as modules, and the ATen file is clearly there too (/home/kklein/.apps/tc_0_1_1/anaconda/envs/tc_build/lib/python3.6/site-packages/torch/lib/libATen.so.1).In the snippet above you can see the full output.

ftynse commented 6 years ago

Architecture detection works by creating, compiling and executing a simple cuda executable that queries the GPU properties. If was not compiled for some reason or cannot be run for some reason, architecture detection won't work. Common reason include improperly configured or incomplete cuda toolkit and wrong cmake flags. I can hazard a guess that your cudnn-related flags are wrong, and that cmake attempts to compile the detection file using those flags (for no reason), leading to an error.

The ${TC_DIR}/build should contain a file detect_cuda_archs.cu. Try and compile it by hand. Run it and see if it prints something, that should be the arch.

skimo-openhub commented 6 years ago

On Wed, Jun 27, 2018 at 03:46:20PM +0000, Kevin Klein wrote:

Hi, yes, sorry, the LAPACK issue was when I tried building before the build system got a makeover. I changed the title now. I'm on a cluster with many different nodes with various NVidia GPUs (https://www.macs.hw.ac.uk/~hv15/robotarium/), I'm not sure about the Cuda card or why the auto detection failed - I'd have to ask the admin. I think we have Ubuntu running on the dgx-1 node, but a redhat distribution (scientific linux) running on the other nodes. I can also load CUDNN as a module... we have 5.1, 6.0 and 7.0 available. I can set the path in the .bashrc file too if necessary. I did install the pytorch conda package, the path is there... the file is calles libATen.so.1.

Hmm... which version of pytorch did you install? The TC CMakeLists.txt is expecting a file called "libATen.so". (I have pytorch-0.4.0-py36hdf912b8_0 and it seems pytorch-0.4.0-py36_cuda9.0.176_cudnn7.1.2_1 also has "libATen.so")

So now back to happy hacking...

Does this mean your problem has been solved?

skimo

ftynse commented 6 years ago

I can also load CUDNN as a module

This module loading (I suppose they come from easybuild, which is common on hpc systems) is merely rewriting environment variables. So does conda, and I hope they don't clash. Cmake is known to ignore or deprioritize paths from environment variables.

export CUDNN_INCLUDE_DIR CUDNN_LIBRARY="/cm/shared/modulefiles/cudnn/6.0

These are cmake variables that it handles internally. Setting them in shell has zero effect (unless we tell cmake to look for them, which we did not). Try CUDNN_ROOT_DIR=$CONDA_PREFIX if you installed cudnn with conda, or CUDNN_ROOT_DIR=$CONDA_PREFIX=<path-to-system-cudnn> if you did not.