Closed amorehead closed 3 years ago
I should also mention that following up the cmake command listed above with the command "make -j4" generates the errors below:
... Build steps leading up to 39% ... (default at OPT(3)) has the potential to alter the semantics of a program. Please refer to documentation on the STRICT/NOSTRICT option for more information. [ 39%] Linking CXX static library libdmlc.a [ 39%] Built target dmlc make: *** [all] Error 2
Hi @amorehead I am encountering a similar problem in the build process, but I am using a different compiler, namely GCC 7.3. In my case at the beginning I get the same as you with SSE2 test failing because it's a characteristic of Intel x86 architecture. In the DGL git repository there is an include folder that contains Intel specific files ans I am wondering if it's even possible to just build the dgl library on PowerPC. Maybe some source code has to be adapted to make it work.. did you find a solution in the meantime?
Hi, @fxd24 . I have not been able to get this to build yet, no. I am hoping someone else who has figured this out before will notice this issue before too long.
You can change the option USE_AVX
to OFF
when build the code, by cmake -DUSE_CUDA=ON -DUSE_AVX=OFF ..
or change the option inside cmake/config.cmake
@VoVAllen After running cmake in the "build" directory with the command you suggested above (i.e. cmake -DUSE_CUDA=ON -DUSE_AVX=OFF ..), I was presented with the following error after running its corresponding make command (i.e. make -j4):
[ 37%] Building C object third_party/METIS/libmetis/CMakeFiles/metis.dir/timing.c.o [ 37%] Building C object third_party/METIS/libmetis/CMakeFiles/metis.dir/util.c.o [ 38%] Building C object third_party/METIS/libmetis/CMakeFiles/metis.dir/wspace.c.o [ 39%] Linking C static library libmetis.a [ 39%] Built target metis 1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program. Please refer to documentation on the STRICT/NOSTRICT option for more information. [ 39%] Linking CXX static library libdmlc.a [ 39%] Built target dmlc make: *** [all] Error 2
After running the same make command above with VERBOSE=1, I see the following errors:
make[2]: Leaving directory `/gpfs/alpine/bip198/scratch/acmwhb/Repositories/Lab_Repositories/dgl/build'
Re-run cmake no build system arguments
[ 39%] Built target metis
-- The C compiler identification is XLClang 16.1.1.5
-- The CXX compiler identification is XLClang 16.1.1.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /sw/summit/xl/16.1.1-5/xlC/16.1.1/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /sw/summit/xl/16.1.1-5/xlC/16.1.1/bin/xlC - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Using Python interpreter: python
Traceback (most recent call last):
File "/gpfs/alpine/scratch/acmwhb/bip198/Repositories/Lab_Repositories/dgl/tensoradapter/pytorch/find_cmake.py", line 1, in
CMake Error at CMakeLists.txt:17 (list): list GET given empty list
-- Configuring for PyTorch -- Setting directory to /Torch CMake Error at CMakeLists.txt:22 (find_package): By not providing "FindTorch.cmake" in CMAKE_MODULE_PATH this project has asked CMake to find a package configuration file provided by "Torch", but CMake did not find one.
Could not find a package configuration file provided by "Torch" with any of the following names:
TorchConfig.cmake
torch-config.cmake
Add the installation prefix of "Torch" to CMAKE_PREFIX_PATH or set "Torch_DIR" to a directory containing one of the above files. If "Torch" provides a separate development package or SDK, be sure it has been installed.
-- Configuring incomplete, errors occurred! See also "/gpfs/alpine/scratch/acmwhb/bip198/Repositories/Lab_Repositories/dgl/tensoradapter/pytorch/build/CMakeFiles/CMakeOutput.log". make[2]: [CMakeFiles/tensoradapter_pytorch] Error 1 make[2]: Leaving directory `/gpfs/alpine/bip198/scratch/acmwhb/Repositories/Lab_Repositories/dgl/build' make[1]: [CMakeFiles/tensoradapter_pytorch.dir/all] Error 2 make[1]: Leaving directory `/gpfs/alpine/bip198/scratch/acmwhb/Repositories/Lab_Repositories/dgl/build' make: *** [all] Error 2
I believe the reason for the above error is that I am not compiling DGL in a Conda environment. By default, only Python 2 is installed in the default environment (i.e. outside of any Conda or pip environment - that is, globally).
Torch is optional, which can accelerate memory allocation in DGL. To build without torch, you can try cmake -DBUILD_TORCH=OFF -DUSE_CUDA=ON -DUSE_AVX=OFF ..
Thanks @VoVAllen ! The AVX flag did it! Here are my steps @amorehead :
Before you begin create a conda environment: conda create -n ENV_NAME python=3.7
Installing DGL requires a few dependencies that may cause some overhead steps in the installation process.
We require gcc compiler that is version >= 5.x.x.
To install a newer gcc compiler withing the Conda Environment type in the following:
conda install cudatoolkit-dev gxx_linux-ppc64le=7
Then you clone the git repo of dgl. See https://docs.dgl.ai/install/index.html. (I didn't use the config.cmake) Now after the cmake files are created we are not ready to install it through make because we have to point the compiler to the newly installed one by changing the following in the build folder CMakeCache.txt file (Note: the file changes back to the default one after executing cmake again):
//CMAKE_CXX_COMPILER:FILEPATH=/usr/bin/c++
CMAKE_CXX_COMPILER:FILEPATH=.conda/envs/<ENV_NAME>/bin/powerpc64le-conda_cos7-linux-gnu-c++
...
//C compiler
CMAKE_C_COMPILER:FILEPATH=.conda/envs/<ENV_NAME>/bin/powerpc64le-conda_cos7-linux-gnu-cc
We also have to disable AVX optimization as they only work on x86 architecture if no emulation is used or some kind of mapping. Therefore, set OFF in the following option.
//Build with AVX optimization
USE_AVX:STRING=OFF
Note that the filepath may be different in your cluster.
Then we have another problem caused by the compilation process using -march=native
which is not supported
on PowerPC and has to switched to: -mcpu=native
.
Therefore we have to change the flags for each of the files causing the problem:
library/dgl/build/third_party/METIS/libmetis/CMakeFiles/metis.dir/flags.make
and change -march=native
with -mcpu=native
.Finally, you can type make -j4
Thank you for sharing, @fxd24 ! Since I do not have the permissions on the cluster I am using to install Cuda in a Conda environment, I have been running the instructions from https://docs.dgl.ai/install/index.html with the modifications I've listed above. This time, I also tried editing the METIS package's flags.make file you mentioned to have the "-MCPU=NATIVE" flag appended to the "C_FLAGS = ..." variable, and I am still encountering the following errors in building METIS:
[ 39%] Built target metis
-- The C compiler identification is XLClang 16.1.1.5
-- The CXX compiler identification is XLClang 16.1.1.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /sw/summit/xl/16.1.1-5/xlC/16.1.1/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /sw/summit/xl/16.1.1-5/xlC/16.1.1/bin/xlC - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Using Python interpreter: python
-- find_cmake.py output: /gpfs/alpine/bip198/scratch/acmwhb/Repositories/Lab_Repositories/RGSET/venv/lib/python3.6/site-packages/torch/share/cmake;1.6.0a0
-- Configuring for PyTorch 1.6.0a0
-- Setting directory to /gpfs/alpine/bip198/scratch/acmwhb/Repositories/Lab_Repositories/RGSET/venv/lib/python3.6/site-packages/torch/share/cmake/Torch
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
CMake Warning at /gpfs/alpine/scratch/acmwhb/bip198/Repositories/Lab_Repositories/RGSET/venv/lib/python3.6/site-packages/torch/share/cmake/Caffe2/public/protobuf.cmake:88 (message):
Protobuf cannot be found. Depending on whether you are building Caffe2 or
a Caffe2 dependent library, the next warning / error will give you more
info.
Call Stack (most recent call first):
/gpfs/alpine/scratch/acmwhb/bip198/Repositories/Lab_Repositories/RGSET/venv/lib/python3.6/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:56 (include)
/gpfs/alpine/scratch/acmwhb/bip198/Repositories/Lab_Repositories/RGSET/venv/lib/python3.6/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package)
CMakeLists.txt:22 (find_package)
CMake Error at /gpfs/alpine/scratch/acmwhb/bip198/Repositories/Lab_Repositories/RGSET/venv/lib/python3.6/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:58 (message): Your installed Caffe2 version uses protobuf but the protobuf library cannot be found. Did you accidentally remove it, or have you set the right CMAKE_PREFIX_PATH? If you do not have protobuf, you will need to install protobuf and set the library path accordingly. Call Stack (most recent call first): /gpfs/alpine/scratch/acmwhb/bip198/Repositories/Lab_Repositories/RGSET/venv/lib/python3.6/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:40 (find_package) CMakeLists.txt:22 (find_package)
@amorehead You can add -DBUILD_TORCH=OFF
to your build options, which will skip the Torch/Caffe checking. Or you can specify the conda python path which contains torch by add -DPYTHON_INTERP=#Your conda python path
Through a combination of collaborative efforts, I was finally able to get DGL compiled properly on my PowerPC cluster! Thank you all for your help and guidance. Here are the exact steps I followed to get it installed as a Python dependency in a Conda environment:
(On my cluster's login node - with no Conda/venv environments activated)
Hey, everyone. I am encountering a new error when I try to run "make -j4" for DGL 0.7 on a Power9 (PowerPC) architecture (e.g., ORNL's Summit system). In the DGL Slack channel, @BarclayII suggested I use the USE_LIBXSMM=OFF flag for CMake to ignore LIBXSMM since it does not (currently) support the PowerPC architecture. I have updated my commands above to reflect this approach.
However, even while I can get around the above error with the USE_LIBXSMM=OFF flag, I am now encountering another error when I run a Python script that simply imports DGL.
Any ideas as to what's missing in my build script to get this C library showing up on my path?
@amorehead This seems not related DGL. This error usually means you built on a system with higher glibc version and run it on a machine with lower glibc version. Could you provide more details about your build environment?
@VoVAllen, This error seems strange, because I am building DGL on Summit (a Power9/PowerPC GPU server) and then immediately going to test it in a Python script on Summit (same environment as the build environment, with the same HPC modules loaded). I am building DGL exactly as I have outlined above, and once the Python bindings are installed in my local Conda environment, I go to test DGL in a Python script and am greeted with the above "version GLIBCXX_3.4.26 not found" error.
To see which version of GLIB is available on Summit, I ran "module spider GLIB" to see these results.
It looks like the platform only has version 2.66.2 of GLIB available to users. Do you know if version 3 of GLIB became the default in versions 0.7 and 0.8 of DGL? I had DGL working just fine on Summit with version 0.6.
Could you try ldd libdgl.so
to see the libc.so? Did you have any conda environment? Because conda might change the RPATH of the dynamic library and other environment variable. Could you try compile and run in the conda env?
@VoVAllen, the results of my "ldd libdgl.so" are as follows:
(DeepInteract)[acmwhb@login1.summit DeepInteract]$ ldd /ccs/home/acmwhb/.conda/envs/DeepInteract/lib/python3.8/site-packages/dgl-0.7.0-py3.8-linux-ppc64le.egg/dgl/libdgl.so
linux-vdso64.so.1 (0x00007fffb9510000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007fffb4690000)
librt.so.1 => /lib64/power9/librt.so.1 (0x00007fffb4660000)
libcublas.so.11 => /sw/summit/cuda/11.0.3/lib64/libcublas.so.11 (0x00007fffae840000)
libcusparse.so.11 => /sw/summit/cuda/11.0.3/lib64/libcusparse.so.11 (0x00007fffa4f40000)
libcurand.so.10 => /sw/summit/cuda/11.0.3/lib64/libcurand.so.10 (0x00007fffa05a0000)
libpthread.so.0 => /lib64/power9/libpthread.so.0 (0x00007fffa0550000)
libgomp.so.1 => /sw/summit/gcc/9.3.0-2/lib64/libgomp.so.1 (0x00007fffa04e0000)
libstdc++.so.6 => /sw/summit/gcc/9.3.0-2/lib64/libstdc++.so.6 (0x00007fffa0250000)
libm.so.6 => /lib64/power9/libm.so.6 (0x00007fffa0120000)
libgcc_s.so.1 => /sw/summit/gcc/9.3.0-2/lib64/libgcc_s.so.1 (0x00007fffa00e0000)
libc.so.6 => /lib64/power9/libc.so.6 (0x00007fff9fed0000)
/lib64/ld64.so.2 (0x00007fffb9530000)
libcublasLt.so.11 => /sw/summit/cuda/11.0.3/lib64/libcublasLt.so.11 (0x00007fff95180000)
I also tried to compile DGL from source inside the Conda environment in which DGL is ultimately being installed. It did not seem to affect the overall result: when I run a simple Python script inside my Conda environment and try to import dgl, it still complains that "version GLIBCXX_3.4.26 not found".
I tried multiple versions of DGL as well as of GCC. It looks like DGL only recommends using up to GCC 9 for newer builds of the library. Another thought that came to mind is, there are some references to Power8 in the METIS flags.make file that Cmake generates.
# CMAKE generated file: DO NOT EDIT!
# Generated by "Unix Makefiles" Generator, CMake Version 3.20
# compile C with /sw/summit/gcc/9.3.0-2/bin/gcc
C_DEFINES = -DDGL_USE_CUDA -DENABLE_PARTIAL_FRONTIER=0
C_INCLUDES = -I/sw/summit/cuda/11.0.3/include -I/gpfs/alpine/scratch/acmwhb/bif132/Repositories/Intermediate_Repositories/dgl0.7/third_party/METIS/GKlib -I/gpfs/alpine/scratch/acmwhb/bif132/Repositories/Intermediate_Repositories/dgl0.7/third_party/METIS/include -I/gpfs/alpine/scratch/acmwhb/bif132/Repositories/Intermediate_Repositories/dgl0.7/third_party/METIS/libmetis/.
C_FLAGS = -fopenmp -O2 -Wall -fPIC -mcpu=power8 -mtune=power8 -mpower8-fusion -mpower8-vector -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O3 -pipe -DIDXTYPEWIDTH=64 -DREALTYPEWIDTH=32 -DLINUX -D_FILE_OFFSET_BITS=64 -std=c99 -fno-strict-aliasing -march=native -fPIC -Werror -Wall -pedantic -Wno-unused-function -Wno-unused-but-set-variable -Wno-unused-variable -Wno-unknown-pragmas -DNDEBUG -DNDEBUG2 -DHAVE_EXECINFO_H -DHAVE_GETLINE -O3
Before Summit had a large OS upgrade (from RHEL 7 to RHEL 8) and with version 0.6 of DGL, I only had to replace the -march=native flag with -mcpu=native flag to get it to work on Summit. However, after the Summit upgrade and with newer DGL versions, it looks like Cmake is now populating these new -mtune and -mpower flags with power8 values. Do you think these may cause any issues (since we technically working with a Power9 architecture)? When these started showing up, I defaulted to removing "-march=native" to leave "-mcpu=power8"
@amorehead Sorry for the late reply. Basically we don't have any restriction on glibc. Could you try find out which library exact depends on the higher version of glibc? Something like checking the running library dependency?
I noticed that you were using an HPC. What are the modules you have loaded? I believe one of the modules you loaded has a higher GLIBC.
@VoVAllen and @BarclayII, thank you for your thoughtful replies. @BarclayII, below is the output my running "module list" on the HPC cluster of interest:
(DeepInteract)[acmwhb@login3.summit DeepInteract]$ module list
Currently Loaded Modules: 1) lsf-tools/2.0 2) hsi/5.0.2.p5 3) darshan-runtime/3.3.0-lite 4) xalt/1.2.1 5) DefApps 6) open-ce/1.2.0-py38-0 7) gcc/9.3.0 8) cmake/3.20.2 9) spectrum-mpi/10.4.0.3-20210112 10) cuda/11.0.3
This follows my build instructions up above, where I load in CMake, CUDA, GCC, and open-ce (a distributed deep learning module specific to the cluster I am running on - https://github.com/open-ce/open-ce).
Is there any more information you would find relevant for troubleshooting which library requires a higher version of glibc?
Could you take a look into these environments and see if they introduce a newer GLIBC? Probably one of the modules changed RPATH
or LIBRARY_PATH
so the paths are being messed up.
LIBRARY_PATH is for linking at compile stage. LD_LIBRARY_PATH is for linking at the runtime stage
Close the issue for now. Feel free to reopen.
Hi all, I seem to be running into similar errors. However, none of the above comments have worked so far.
I am running
conda activate ENV
module load profile/deeplrn autoload hpc/2.2.0 (loads pytorch/cuda 11.0 and other deeplr related packages)
module load cmake
module load gnu
git clone --depth 1 --branch 0.6.x https://github.com/dmlc/dgl.git (for installing DGL 0.6) or git clone https://github.com/dmlc/dgl.git (for installing the latest release of DGL)
cd dgl/
git submodule update --init --recursive
mkdir build
cp cmake/config build/
cd build/
cmake -DUSE_AVX=OFF -DUSE_CUDA=ON -DUSE_LIBXSMM=OFF -DBUILD_TORCH=OFF ..
nano third_party/METIS/libmetis/CMakeFiles/metis.dir/flags.make (to replace '-march=native' with '-mcpu=native', followed by writing changes to storage and exiting file)
make -j4
I get the following error:
CMake Error at /hpc/prod/opt/libraries/hpc-ai/2.2.0/none/hpc-ai-conda-env-py3.8-cuda-openmpi-11.0/lib/python3.8/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:58 (message): Your installed Caffe2 version uses protobuf but the protobuf library cannot be found. Did you accidentally remove it, or have you set the right CMAKE_PREFIX_PATH? If you do not have protobuf, you will need to install protobuf and set the library path accordingly. Call Stack (most recent call first): /hpc/prod/opt/libraries/hpc-ai/2.2.0/none/hpc-ai-conda-env-py3.8-cuda-openmpi-11.0/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package) CMakeLists.txt:22 (find_package
If I don't load the config from cmake when creating build I get instead: CMake Error at dgl_generated_array_nonzero.cu.o.cmake:276 (message): Error generating file /m100/home/[username]/dgl/build/CMakeFiles/dgl.dir/src/array/cuda/./dgl_generated_array_nonzero.cu.o
any ideas?
❓ PowerPC (Power9) Source Compilation
Hello. I have recently been trying to compile DGL from source on a Power9 (PowerPC) Linux-based cluster, and I am not having much luck. The steps I have taken to try to compile it from source as follows:
Running step 8 results in: -- The C compiler identification is XLClang 16.1.1.5 -- The CXX compiler identification is XLClang 16.1.1.5 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /sw/summit/xl/16.1.1-5/xlC/16.1.1/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /sw/summit/xl/16.1.1-5/xlC/16.1.1/bin/xlC - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Start configuring project dgl -- Performing Test SUPPORT_CXX11 -- Performing Test SUPPORT_CXX11 - Success -- Found OpenMP_C: -qsmp=omp (found version "4.5") -- Found OpenMP_CXX: -qsmp=omp (found version "4.5") -- Found OpenMP: TRUE (found version "4.5")
-- Build with OpenMP. -- Build with AVX optimization. CMake Warning (dev) at third_party/dmlc-core/cmake/Utils.cmake:196 (option): Policy CMP0077 is not set: option() honors normal variables. Run "cmake --help-policy CMP0077" for policy details. Use the cmake_policy command to set the policy and suppress this warning.
For compatibility with older versions of CMake, option is clearing the normal variable 'USE_OPENMP'. Call Stack (most recent call first): third_party/dmlc-core/CMakeLists.txt:20 (dmlccore_option) This warning is for project developers. Use -Wno-dev to suppress it.
-- Found OpenMP_C: -qsmp=omp (found version "4.5") -- Found OpenMP_CXX: -qsmp=omp (found version "4.5") -- Looking for clock_gettime in rt -- Looking for clock_gettime in rt - found -- Looking for fopen64 -- Looking for fopen64 - not found -- Looking for C++ include cxxabi.h -- Looking for C++ include cxxabi.h - found -- Looking for nanosleep -- Looking for nanosleep - found -- Looking for backtrace -- Looking for backtrace - found -- backtrace facility detected in default set of libraries -- Found Backtrace: /usr/include
-- Check if the system is big endian -- Searching 16 bit integer -- Looking for sys/types.h -- Looking for sys/types.h - found -- Looking for stdint.h -- Looking for stdint.h - found -- Looking for stddef.h -- Looking for stddef.h - found -- Check size of unsigned short -- Check size of unsigned short - done -- Searching 16 bit integer - Using unsigned short -- Check if the system is big endian - little endian -- /gpfs/alpine/scratch/acmwhb/bip198/Repositories/Lab_Repositories/dgl/third_party/dmlc-core/cmake/build_config.h.in -> include/dmlc/build_config.h -- Performing Test SUPPORT_MSSE2 -- Performing Test SUPPORT_MSSE2 - Failed -- Looking for execinfo.h -- Looking for execinfo.h - found -- Looking for getline -- Looking for getline - found -- Configuring done -- Generating done -- Build files have been written to: /gpfs/alpine/scratch/acmwhb/bip198/Repositories/Lab_Repositories/dgl/build
It looks like the build script fails when testing for MSSE2 support, and I am not sure where to go from here. I would appreciate any advice you might have to offer!