T9-GPU does not compile

gwiecz1 commented 2 years ago

Hello, I tried to compile Tinker9 with GPU support today, with both Intel (included in oneapi compiler 2022.1) and GNU compilers (Debian 10.3.0-15, Debian 11.3.0-3), Nvidia hpc_sdk 22.5. The build was failing in 2 places, depending on the configuration options. Issued exemplary cmake commands:

cmake .. -DCMAKE_Fortran_COMPILER_ID=GNU -DCUDA_DIR=/opt/nvidia/hpc_sdk/Linux_x86_64/22.5/cuda -DCMAKE_CUDA_ARCHITECTURES="60;70" -DCMAKE_VERBOSE_MAKEFILE=ON cmake .. -DCMAKE_Fortran_COMPILER_ID=Intel -DCUDA_DIR=/opt/nvidia/hpc_sdk/Linux_x86_64/22.5/cuda -DCMAKE_CUDA_ARCHITECTURES="60;70" -DCMAKE_VERBOSE_MAKEFILE=ON

Both Intel and Nvidia stuff installed and checked. Attached please find the last stage of build, which failed during the linking. I don't know how to properly submit such cases, I apologize for the inconvenience, and humbly ask for advise. Best Regards, Grzegorz intel_fail.txt.gz

zhi-wang commented 2 years ago

Hi Grzegorz! Here are my observations:

I am unfamilar with the flag -DCMAKE_Fortran_COMPILER_ID. I understand what it stands for, and I've seen it in the past, but I'm not sure if this is the correct way to set the Fortran compiler in CMake. If it works, great! And these are two standard methods provided by CMake:
- Set your GNU Fortran compiler in cmake -DCMAKE_Fortran_COMPILER=gfortran ...
- Set your GNU Fortran compiler in an environmental variable FC as FC=gfortran cmake ..., or you can export FC=gfotran before you run cmake.
gfortran will be changed to ifort (or whatever) in certain places.
I assume you are using a recent CMake version, which has adopted better support for nvcc and more significantly for nvhpc. But the reality is Tinker9 has to deal with a lot of old environments, so that we cannot use those new flags provided by CMake. It is not clear to what the consequence would be if you used these new flags. Your attached linking errer seemed irrelavent. There is a chance that it'll be fine in the end, but it won't surprise me if you run into strange runtime error even if the compilation looked fine. Those flags provided by Tinker9 are:
- CMAKE_CUDA_ARCHITECTURES needs to be replaced by -DCOMPUTE_CAPABILITY=comma,separated
- CUDA_DIR must be /usr/local/cuda or canonical cuda (which is not from nvhpc) installed somewhere else due to the compatablity issues with older CMake -- yes, users with older CMake has to install a separate cuda.
- More "custom flags" are explained here https://github.com/TinkerTools/tinker9/blob/master/doc/manual/m/install/buildwithcmake.rst

As for your attached linking error, I suspect that the Fortran compiler is what you actually used -- it might not be intel, unless intel has changed something they'd left untouched for decades. Assuming my concerns in the previous bullet points are unnecessary, I'd like to see the output of the following command (as we don't have an intel compiler now):

# under the build directory, we can cd to...
cd build/ext/interface/CMakeFiles/tinkerObjF.dir/__/source
# there are a lot of .o files, e.g., atoms.f.o. What do you get with nm -n atoms.f.o?
nm -n atoms.f.o
00... B __atoms_MOD_x
00... B __atoms_MOD_y
00... B __atoms_MOD_z
00... B __atoms_MOD_n
00... B __atoms_MOD_type
# These are the objects compiled by gfortran.
# Do you see atoms_mp_n_, atoms_mp_x_, etc. with your intel build?

gwiecz1 commented 2 years ago

Zhi Wang, Thank you so much for your help! Setting FC=gfortran (or ifort) and CXX=g++ (or icpc) and setting CMAKE_Fortran_COMPILER accordingly did the trick. I compiled Tinker9 with gcc versions 9, 10 and 11 and icpc (ICC) 2021.6.0 20220226. However, "make all" finished with error while compiling tests with all the compilers:

[ 94%] Building CXX object test/CMakeFiles/__t9_all_tests_o.dir/async.cpp.o "/home/gigo/tinker/tinker9_01/test/async.cpp", line 42: error: namespace "std::this_thread" has no member "sleep_for" this_thread::sleep_for(milliseconds(dup_ms));

Do you want me to investigate the issue?

% cd ext/interface/CMakeFiles/tinkerObjF.dir/_/source % nm -n atoms.f.o 0000000000000000 T atoms. 0000000000000004 C atoms_mpn 00000000003d0900 C atoms_mptype 00000000007a1200 C atoms_mpx 00000000007a1200 C atoms_mpy 00000000007a1200 C atoms_mpz

The cmake I use is 3.23.2 (debian sid).

Thank you very much! Grzegorz

zhi-wang commented 2 years ago

I've never seen this problem. This error says that the c++ compiler cannot find member sleep_for in the standard namespace std::this_thread, which should be inside the standard c++ header <thread>. This means something is different in your c++ compiler and/or your system toolchains, as I've never seen it with my g++/ubuntu environment. Since I don't have your environment to test my hypothesis, I wish you could give me some help here.

My hypothesis is that your system has a different set of c++ header files than the one in Ubuntu. In the file test/async.cpp, the header <thread> is not explicitly included. This is not a problem in Ubuntu. I'd like to try to include this header file in this source code and proceed to re-compile again to see if it'll be gone. If you are seeing this problem, I'd like to have a more verbose error message to investigate. (make VERBOSE=1)

Thanks!

gwiecz1 commented 2 years ago

Hi, My joy was slightly premature. I have 3 CUDA installations on the test machine now:

CUDA which came with the nvhpc compiler. Tinker9 compiles, but gives the runtime error saying that it was not compiled with proper toolchain.
CUDA from official debian package - T9 does not compile. Undefined reference to '__nv_sqrtf' and more __nv functions during linking.
CUDA from the official nvidia installer. T9 compiles. When run it says: Terminating with uncaught exception : merge_sort: failed on 2nd step: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device. I did not expect much more, since the nvidia driver (from debian package) does not conform to the minimal driver version requirement of the CUDA installer.

I'm kind of stuck now. I can't afford breaking the package update mechanism by installing separate CUDA (and the nvidia driver) on all my machines, and then maintaining the resulting mess. The only way I see is to overcome the compilation problems with debian CUDA (which I use with a bunch of software already). Thank you for your help once more. For now T9 is "no go" for me, I will let you know of any developments. Best Regards, Grzegorz

gwiecz1 commented 2 years ago

Update: T9 with CUDA from official Nvidia installer works despite the driver versions discrepancy. I just did not provide proper compute capabilities previously. I will now come back to the test compilation errors, since I would like to make sure that it produces valuable output :) g

gwiecz1 commented 2 years ago

Hi Zhi Wang, As you suggested, explicit inclusion of <thread> in test/async.cpp fixes the compilation problems of tests on my system. All tests passed (83087 assertions in 67 test cases) Regards, g

zhi-wang commented 2 years ago

Cool! I was going to ask you how cmake was configured. I'm glad you figured it out. Would you please share with us the details of your toolchains, including the versions of your OS and C++, Fortran compilers, so I can better document Tinker9? Thanks!

gwiecz1 commented 2 years ago

The testing machine is: CPU: 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz GPU: NVIDIA RTX A3000 (max cap. 8.6) Debian sid 5.17.11-1 (2022-05-26) x86_64 GNU/Linux Compilers: gcc (Debian 11.3.0-3) 11.3.0 - for gcc compilation icpc and ifort 2021.6.0 20220226 - for intel compilation nvc++ (nvhpc) 22.5.0 nvcc 11.7.64

Tinker9 1.0.0 GIT 18f90f19 (+1 line :)) Tinker GIT 3dc966e2 nvidia driver 470.129.06 CUDA 11.7.0 cmake 3.23.2

And now ... back to undefined references with debian cuda :) Thanks! g

gwiecz1 commented 2 years ago

Hi Zhi Wang, What are the oldest versions of cmake and cuda toolkit that you want to support? Best, g

zhi-wang commented 2 years ago

Hi we found that for the current features used in CMake, the minimum version is 3.15. We tested the code with CUDA 9.1. I would naively assume 9.0 would also work and we don't plan to support CUDA 8 or older.

gwiecz1 commented 2 years ago

OK, I will work on it.

gwiecz1 commented 2 years ago

Hello, The problem was the lack of compatibilty between nvidia hpc sdk and nvcc/CUDA in debian sid. That is all. Besides, Tinker9 compiles and works well with CUDA installation from nvhpc! I could remove extraneous CUDA from /usr/local. I think that using CUDA (and nvcc) from nvhpc may be advisable. Since they are distributed together, they should be compatible, while nvhpc and CUDA from CUDA-only installer may fall into the same trap as the above nvhpc/debian-CUDA above, if updated asynchronously. cmake .. -DCMAKE_Fortran_COMPILER=gfortran -DCOMPUTE_CAPABILITY=86 -DCUDA_DIR=/opt/nvidia/hpc_sdk/Linux_x86_64/22.5/cuda works like a charm. I think that this concludes this thread. Thank you, g

TinkerTools / tinker9

T9-GPU does not compile #199