Closed TomMelt closed 1 year ago
Hmm, it seems that CMake is picking up an alternative MPI installation in your environment, other than the NVHPC one that may not have Fortran support: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/openmpi-4.1.5-eq5qt6oay5atbk4jff6f5fg6tfmugwsp/lib/libmpi.so (found version "3.1")
I wonder if it might be due to an extraneous call to find_package(MPI REQUIRED)
in our CMakeLists.txt
here: https://github.com/NVIDIA/TorchFort/blob/e06613d6feccc3d11c166f146abce7abdd85f1b3/CMakeLists.txt#L24
Can you try commenting that out from the CMakeLists.txt
file and see if that resolves this issue?
I have managed to get past the previous error by
CMakeLists.txt
as suggested, andhdf5
with nvhpc
(previously it was built with gfortran
which linked to a separate build of openmpi
)But now it fails at the following point
-- The CXX compiler identification is NVHPC 23.7.0
-- The Fortran compiler identification is NVHPC 23.7.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvc++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvfortran - skipped
-- Found CUDA: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd (found version "11.8")
-- The CUDA compiler identification is NVIDIA 11.8.89
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Caffe2: CUDA detected: 11.8
-- Caffe2: CUDA nvcc is: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd/bin/nvcc
-- Caffe2: CUDA toolkit directory: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd
-- Caffe2: Header version is: 11.8
-- /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/cuda-11.8.0-dmxquapj2bbxtifgzf3fwl423bjh3qjd/lib64/libnvrtc.so shorthash is 672ee683
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_80,code=sm_80
CMake Warning at /home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
/home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
CMakeLists.txt:23 (find_package)
-- Found Torch: /home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/lib/libtorch.so
-- CUDA version selected: 11.8
-- Found MPI_CXX: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi_cxx.so (found version "3.1")
-- Found MPI_Fortran: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi_usempif08.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Found HDF5: hdf5::hdf5_fortran-shared (found version "1.8.21") found components: Fortran
-- Found Python: /home/user/miniconda3/envs/torchfort/bin/python3.11 (found suitable version "3.11.4", minimum required is "3.6") found components: Interpreter Development Development.Module Development.Embed
-- Found pybind11: /home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/pybind11/include (found version "2.11.1")
CMake Warning (dev) at /home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/pybind11/share/cmake/pybind11/pybind11NewTools.cmake:220 (if):
Policy CMP0057 is not set: Support new IN_LIST if() operator. Run "cmake
--help-policy CMP0057" for policy details. Use the cmake_policy command to
set the policy and suppress this warning.
IN_LIST will be interpreted as an operator when the policy is set to NEW.
Since the policy is not set the OLD behavior will be used.
Call Stack (most recent call first):
examples/cpp/cart_pole/CMakeLists.txt:19 (pybind11_add_module)
This warning is for project developers. Use -Wno-dev to suppress it.
CMake Error at /home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/pybind11/share/cmake/pybind11/pybind11NewTools.cmake:220 (if):
if given arguments:
"NOT" "ARG_WITHOUT_SOABI" "AND" "NOT" "WITH_SOABI" "IN_LIST" "ARG_UNPARSED_ARGUMENTS"
Unknown arguments specified
Call Stack (most recent call first):
examples/cpp/cart_pole/CMakeLists.txt:19 (pybind11_add_module)
-- Configuring incomplete, errors occurred!
Do you have any ideas why it is failing?
Based on the error messages, it seems like this error has to do with this CMP0057
CMake policy being set to OLD
. If you add the line:
cmake_policy(SET CMP0057 NEW)
to the top of CMakeLists.txt
and see if that resolves this issue?
I have managed to build the project by
CMakeLists.txt
as suggested,hdf5
with nvhpc
(previously it was built with gfortran
which linked to a separate build of openmpi
), andcmake_policy(SET CMP0057 NEW)
to the CMakeLists.txt
as suggestedI get the following output from CMake
:
-- The CXX compiler identification is NVHPC 23.7.0
-- The Fortran compiler identification is NVHPC 23.7.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvc++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting Fortran compiler ABI info
-- Detecting Fortran compiler ABI info - done
-- Check for working Fortran compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvfortran - skipped
-- Found CUDA: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7//cuda/11.8/ (found version "11.8")
-- The CUDA compiler identification is NVIDIA 12.2.91
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/compilers/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Caffe2: CUDA detected: 11.8
-- Caffe2: CUDA nvcc is: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/cuda/11.8/bin/nvcc
-- Caffe2: CUDA toolkit directory: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7//cuda/11.8/
-- Caffe2: Header version is: 11.8
-- /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/cuda/11.8/lib64/libnvrtc.so shorthash is 672ee683
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_80,code=sm_80
CMake Warning at /home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
/home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
CMakeLists.txt:25 (find_package)
-- Found Torch: /home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/lib/libtorch.so
-- CUDA version selected: 11.8
-- Found MPI_CXX: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi_cxx.so (found version "3.1")
-- Found MPI_Fortran: /software/spack/opt/spack/linux-ubuntu22.04-skylake/gcc-12.3.0/nvhpc-23.7-tdmi4llgnphtlarpvqggtvjukvvnr42w/Linux_x86_64/23.7/comm_libs/openmpi/openmpi-3.1.5/lib/libmpi_usempif08.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Found HDF5: hdf5::hdf5_fortran-shared (found version "1.8.21") found components: Fortran
-- Found Python: /home/user/miniconda3/envs/torchfort/bin/python3.11 (found suitable version "3.11.4", minimum required is "3.6") found components: Interpreter Development Development.Module Development.Embed
-- Found pybind11: /home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/pybind11/include (found version "2.11.1")
-- Configuring done (4.5s)
-- Generating done (0.0s)
-- Build files have been written to: /home/user/sync/projects/side/TorchFort/build
However, now I get the following error when trying to compile with make
[ 2%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/distributed.cpp.o
"/home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/include/torch/csrc/profiler/util.h", line 133: error: identifier "__rdtsc" is undefined
return static_cast<uint64_t>(__rdtsc());
^
"/home/user/sync/projects/side/TorchFort/src/csrc/distributed.cpp", line 56: warning: statement is unreachable [code_is_unreachable]
CHECK_MPI(MPI_Comm_rank(mpi_comm, &rank));
^
Remark: individual warnings can be suppressed with "--diag_suppress <warning-name>"
"/home/user/sync/projects/side/TorchFort/src/csrc/distributed.cpp", line 57: warning: statement is unreachable [code_is_unreachable]
CHECK_MPI(MPI_Comm_size(mpi_comm, &size));
^
"/home/user/sync/projects/side/TorchFort/src/csrc/distributed.cpp", line 62: warning: statement is unreachable [code_is_unreachable]
CHECK_MPI(MPI_Bcast(&id, sizeof(id), MPI_BYTE, 0, mpi_comm));
^
"/home/user/sync/projects/side/TorchFort/src/csrc/distributed.cpp", line 71: warning: statement is unreachable [code_is_unreachable]
CHECK_MPI(MPI_Comm_rank(mpi_comm, &rank));
^
"/home/user/sync/projects/side/TorchFort/src/csrc/distributed.cpp", line 72: warning: statement is unreachable [code_is_unreachable]
CHECK_MPI(MPI_Comm_size(mpi_comm, &size));
^
"/home/user/sync/projects/side/TorchFort/src/csrc/distributed.cpp", line 126: warning: statement is unreachable [code_is_unreachable]
CHECK_MPI(MPI_Allreduce(MPI_IN_PLACE, &val, 1, MPI_DOUBLE, MPI_SUM, mpi_comm));
^
"/home/user/sync/projects/side/TorchFort/src/csrc/distributed.cpp", line 132: warning: statement is unreachable [code_is_unreachable]
CHECK_MPI(MPI_Allreduce(MPI_IN_PLACE, &val, 1, MPI_FLOAT, MPI_SUM, mpi_comm));
^
"/home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/include/c10/util/TypeIndex.h", line 190: error: expression must have a constant value
string_view name = detail::fully_qualified_type_name_impl<T>();
^
"/home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/include/c10/util/TypeIndex.h", line 95: note: expression cannot be interpreted
? (throw std::logic_error("Invalid pattern"), string_view())
^
"/home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/include/c10/util/TypeIndex.h", line 122: note: called from:
return extract(
^
detected during:
instantiation of "c10::string_view c10::util::get_fully_qualified_type_name<T>() noexcept [with T=std::string]" at line 561 of "/home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/include/c10/util/typeid.h"
instantiation of "uint16_t caffe2::TypeMeta::addTypeMetaData<T>() [with T=std::string]" at line 686 of "/home/user/miniconda3/envs/torchfort/lib/python3.11/site-packages/torch/include/c10/util/typeid.h"
There is more of the same output. I have just listed first 50 lines. Do you have any suggestions?
I was looking over our builds and we use the GNU compiler for the C++ files. It appears that nvc++
does not support the __rdtsc()
intrinsic which is where this error is coming from.
Can you try adding the flag -DCMAKE_CXX_COMPILER=g++
to your CMake build line to use the GNU compiler for the C++ files?
If it works, you should see CMake report a line similar to:
-- The CXX compiler identification is GNU 9.4.0
Thanks I managed to get a bit further by using -DCMAKE_CXX_COMPILER=g++
as suggested.
I then hit issue #6 but I have resolved that and submitted a PR #5 .
However, when I run make I still get errors. I am worried it may have something to do with mixing compiler versions (nvfortran and g++). Have you had this issue?
$ make
[ 2%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/distributed.cpp.o
[ 5%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/logging.cpp.o
[ 8%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/model_state.cpp.o
[ 10%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/model_wrapper.cpp.o
[ 13%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/model_pack.cpp.o
[ 16%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/param_map.cpp.o
[ 18%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/setup.cpp.o
[ 21%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/torchfort.cpp.o
[ 24%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/utils.cpp.o
[ 27%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/losses/l1_loss.cpp.o
[ 29%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/losses/mse_loss.cpp.o
[ 32%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/lr_schedulers/cosine_annealing_lr.cpp.o
[ 35%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/lr_schedulers/multistep_lr.cpp.o
[ 37%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/lr_schedulers/polynomial_lr.cpp.o
[ 40%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/lr_schedulers/scheduler_setup.cpp.o
[ 43%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/lr_schedulers/step_lr.cpp.o
[ 45%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/models/mlp_model.cpp.o
[ 48%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/rl/rl.cpp.o
[ 51%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/rl/utils.cpp.o
[ 54%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/rl/ddpg.cpp.o
[ 56%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/rl/td3.cpp.o
[ 59%] Building CXX object CMakeFiles/torchfort.dir/src/csrc/rl/sac.cpp.o
[ 62%] Linking CXX shared library lib/libtorchfort.so
[ 62%] Built target torchfort
[ 64%] Building Fortran object CMakeFiles/torchfort_fort.dir/src/fsrc/torchfort_m.F90.o
[ 67%] Linking Fortran shared library lib/libtorchfort_fort.so
[ 67%] Built target torchfort_fort
[ 70%] Building Fortran object examples/fortran/simulation/CMakeFiles/train.dir/simulation.f90.o
NVFORTRAN-F-0004-Unable to open MODULE file hdf5.mod (/home/user/sync/projects/side/TorchFort/examples/fortran/simulation/simulation.f90: 119)
NVFORTRAN/x86-64 Linux 23.7-0: compilation aborted
make[2]: *** [examples/fortran/simulation/CMakeFiles/train.dir/build.make:88: examples/fortran/simulation/CMakeFiles/train.dir/simulation.f90.o] Error 2
make[1]: *** [CMakeFiles/Makefile2:179: examples/fortran/simulation/CMakeFiles/train.dir/all] Error 2
make: *** [Makefile:136: all] Error 2
This looks like the compilation is failing to find hdf5.mod
which should be in your installed HDF5 include directory. Can you check that your HDF5 include directory has this module installed? Running make VERBOSE=1
should show the compilation line to see if the right include directory with that module is being added to the compile line.
I forgot that spack
puts the hdf5.mod
files in a different location (pathtolib/static/
and pathtolib/shared/
instead of just pathtolib/
). I have moved the shared libs to the main folder and now it builds.
I will now close this issue. Thanks for your help.
I am trying to install
TorchFort
dependencies withspack
and then build withcmake
.So far I have installed the following dependencies (with
spack
and usinggcc
version12.3.0
):I have also setup and configured a
conda
environment which containspython 3.11.4
and I pip installedpybind11 2.11.1
and therequirements.txt
file usingpip install -r requirements.txt
from within the conda environment.I used the following bash script to compile my code:
I get the following error:
For some reason cmake can find the MPI_CXX but not MPI_Fortran. Do you have any ideas how to get this working?