TinkerTools / tinker9

Tinker9: Next Generation of Tinker with GPU Support
Other
48 stars 26 forks source link

T9-GPU does not compile #199

Closed gwiecz1 closed 2 years ago

gwiecz1 commented 2 years ago

Hello, I tried to compile Tinker9 with GPU support today, with both Intel (included in oneapi compiler 2022.1) and GNU compilers (Debian 10.3.0-15, Debian 11.3.0-3), Nvidia hpc_sdk 22.5. The build was failing in 2 places, depending on the configuration options. Issued exemplary cmake commands:

cmake .. -DCMAKE_Fortran_COMPILER_ID=GNU -DCUDA_DIR=/opt/nvidia/hpc_sdk/Linux_x86_64/22.5/cuda -DCMAKE_CUDA_ARCHITECTURES="60;70" -DCMAKE_VERBOSE_MAKEFILE=ON cmake .. -DCMAKE_Fortran_COMPILER_ID=Intel -DCUDA_DIR=/opt/nvidia/hpc_sdk/Linux_x86_64/22.5/cuda -DCMAKE_CUDA_ARCHITECTURES="60;70" -DCMAKE_VERBOSE_MAKEFILE=ON

Both Intel and Nvidia stuff installed and checked. Attached please find the last stage of build, which failed during the linking. I don't know how to properly submit such cases, I apologize for the inconvenience, and humbly ask for advise. Best Regards, Grzegorz intel_fail.txt.gz

zhi-wang commented 2 years ago

Hi Grzegorz! Here are my observations:

gwiecz1 commented 2 years ago

Zhi Wang, Thank you so much for your help! Setting FC=gfortran (or ifort) and CXX=g++ (or icpc) and setting CMAKE_Fortran_COMPILER accordingly did the trick. I compiled Tinker9 with gcc versions 9, 10 and 11 and icpc (ICC) 2021.6.0 20220226. However, "make all" finished with error while compiling tests with all the compilers:

[ 94%] Building CXX object test/CMakeFiles/__t9_all_tests_o.dir/async.cpp.o "/home/gigo/tinker/tinker9_01/test/async.cpp", line 42: error: namespace "std::this_thread" has no member "sleep_for" this_thread::sleep_for(milliseconds(dup_ms));

Do you want me to investigate the issue?

% cd ext/interface/CMakeFiles/tinkerObjF.dir/_/source % nm -n atoms.f.o 0000000000000000 T atoms. 0000000000000004 C atoms_mpn 00000000003d0900 C atoms_mptype 00000000007a1200 C atoms_mpx 00000000007a1200 C atoms_mpy 00000000007a1200 C atoms_mpz

The cmake I use is 3.23.2 (debian sid).

Thank you very much! Grzegorz

zhi-wang commented 2 years ago

I've never seen this problem. This error says that the c++ compiler cannot find member sleep_for in the standard namespace std::this_thread, which should be inside the standard c++ header <thread>. This means something is different in your c++ compiler and/or your system toolchains, as I've never seen it with my g++/ubuntu environment. Since I don't have your environment to test my hypothesis, I wish you could give me some help here.

My hypothesis is that your system has a different set of c++ header files than the one in Ubuntu. In the file test/async.cpp, the header <thread> is not explicitly included. This is not a problem in Ubuntu. I'd like to try to include this header file in this source code and proceed to re-compile again to see if it'll be gone. If you are seeing this problem, I'd like to have a more verbose error message to investigate. (make VERBOSE=1)

Thanks!

gwiecz1 commented 2 years ago

Hi, My joy was slightly premature. I have 3 CUDA installations on the test machine now:

  1. CUDA which came with the nvhpc compiler. Tinker9 compiles, but gives the runtime error saying that it was not compiled with proper toolchain.
  2. CUDA from official debian package - T9 does not compile. Undefined reference to '__nv_sqrtf' and more __nv functions during linking.
  3. CUDA from the official nvidia installer. T9 compiles. When run it says: Terminating with uncaught exception : merge_sort: failed on 2nd step: cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device. I did not expect much more, since the nvidia driver (from debian package) does not conform to the minimal driver version requirement of the CUDA installer.

I'm kind of stuck now. I can't afford breaking the package update mechanism by installing separate CUDA (and the nvidia driver) on all my machines, and then maintaining the resulting mess. The only way I see is to overcome the compilation problems with debian CUDA (which I use with a bunch of software already). Thank you for your help once more. For now T9 is "no go" for me, I will let you know of any developments. Best Regards, Grzegorz

gwiecz1 commented 2 years ago

Update: T9 with CUDA from official Nvidia installer works despite the driver versions discrepancy. I just did not provide proper compute capabilities previously. I will now come back to the test compilation errors, since I would like to make sure that it produces valuable output :) g

gwiecz1 commented 2 years ago

Hi Zhi Wang, As you suggested, explicit inclusion of <thread> in test/async.cpp fixes the compilation problems of tests on my system. All tests passed (83087 assertions in 67 test cases) Regards, g

zhi-wang commented 2 years ago

Cool! I was going to ask you how cmake was configured. I'm glad you figured it out. Would you please share with us the details of your toolchains, including the versions of your OS and C++, Fortran compilers, so I can better document Tinker9? Thanks!

gwiecz1 commented 2 years ago

The testing machine is: CPU: 11th Gen Intel(R) Core(TM) i9-11950H @ 2.60GHz GPU: NVIDIA RTX A3000 (max cap. 8.6) Debian sid 5.17.11-1 (2022-05-26) x86_64 GNU/Linux Compilers: gcc (Debian 11.3.0-3) 11.3.0 - for gcc compilation icpc and ifort 2021.6.0 20220226 - for intel compilation nvc++ (nvhpc) 22.5.0 nvcc 11.7.64

Tinker9 1.0.0 GIT 18f90f19 (+1 line :)) Tinker GIT 3dc966e2 nvidia driver 470.129.06 CUDA 11.7.0 cmake 3.23.2

And now ... back to undefined references with debian cuda :) Thanks! g

gwiecz1 commented 2 years ago

Hi Zhi Wang, What are the oldest versions of cmake and cuda toolkit that you want to support? Best, g

zhi-wang commented 2 years ago

Hi we found that for the current features used in CMake, the minimum version is 3.15. We tested the code with CUDA 9.1. I would naively assume 9.0 would also work and we don't plan to support CUDA 8 or older.

gwiecz1 commented 2 years ago

OK, I will work on it.

gwiecz1 commented 2 years ago

Hello, The problem was the lack of compatibilty between nvidia hpc sdk and nvcc/CUDA in debian sid. That is all. Besides, Tinker9 compiles and works well with CUDA installation from nvhpc! I could remove extraneous CUDA from /usr/local. I think that using CUDA (and nvcc) from nvhpc may be advisable. Since they are distributed together, they should be compatible, while nvhpc and CUDA from CUDA-only installer may fall into the same trap as the above nvhpc/debian-CUDA above, if updated asynchronously. cmake .. -DCMAKE_Fortran_COMPILER=gfortran -DCOMPUTE_CAPABILITY=86 -DCUDA_DIR=/opt/nvidia/hpc_sdk/Linux_x86_64/22.5/cuda works like a charm. I think that this concludes this thread. Thank you, g