brucefan1983 / GPUMD

Graphics Processing Units Molecular Dynamics
https://gpumd.org/dev
GNU General Public License v3.0
439 stars 113 forks source link

nep executable error - "no kernel image is available for execution on the device" #576

Closed antoni-2 closed 2 months ago

antoni-2 commented 5 months ago

Hello,

I would like to report an issue I found using GPUMD version 3.9.1.

I was trying to create the first test neuroevolution potential using the “nep” executable on the cluster I use. After preparing the input files (nep.in, test.xyz, and train.xyz) and running the "nep" command, GPUMD gives the information:

number of GPUs = 1

Device id: 0

Device name:             Tesla K80

Compute capability:      3.7

Amount of global memory: 11.1731 GB

Number of SMs:           13

Then the nep.in file is read successfully. Later:


Started reading train.xyz.


Number of configurations = 20.

Number of devices = 1

Number of batches = 1

Hello, I changed the batch_size from 1000 to 20.

Batch 0:

Number of configurations = 20.


Constructing train_set in device 0.

Total number of atoms = 1000.

Number of atoms in the largest configuration = 50.

Number of configurations having virial = 0.

CUDA Error:

File:       main_nep/dataset.cu

Line:       266

Error code: 209

Error text: no kernel image is available for execution on the device

With the help of the cluster admins, we checked that the error is caused by the command “CUDA_CHECK_KERNEL”, defined in the utilities/error.cuh as:

#define CUDA_CHECK_KERNEL                                                                          \

  {                                                                                                \

    CHECK(cudaGetLastError());                                                                     \

    CHECK(cudaDeviceSynchronize());                                                                \

  }

#else

#define CUDA_CHECK_KERNEL                                                                          \

  {                                                                                                \

    CHECK(cudaGetLastError());                                                                     \

  }

#endif

The function we think is causing the error is cudaDevicesSynchronize(). However, this command seems to work when we run it outside GPUMD.

Configuration of the Cluster: driver version: 470.129.06, CUDA Version: 11.4., GPU card: Tesla K80. The nvcc compilation with NVHPC 23.3 and CUDA 11.8. gave the same effect.

I do not know how to solve this issue. I would be very grateful for your help!

Kind regards, Antoni

brucefan1983 commented 5 months ago

You can try to change CFLAGS = -std=c++14 -O3 -arch=sm_60 to CFLAGS = -std=c++14 -O3 -arch=sm_37 in src/makefile and try again (make clean and then make).

antoni-2 commented 5 months ago

Thank you for your answer. Unfortunately, the error persists. Below I am sending the makefile that was used during the compilation. Different combinations of CFLAGS were tried:

CFLAGS = -std=c++11 -O3 -arch=sm_37
CFLAGS = -std=c++14 -O3 -arch=sm_37

… as well as compilation with and without PLUMED and NetCDF, giving the same effect.

Moreover, the error message remains the same while using the input files (nep.in, train.xyz, and test.xyz) from the repository (GPUMD/examples/11_NEP_potential_PbTe/).

makefile_gpumd_issue_28mar2024.txt

brucefan1983 commented 5 months ago

Then I guess CUDA code does not work in your platform at all. You can try to compile and run the folloiwng simplest CUDA code:

#include <stdio.h>
__global__ void hello_from_gpu()
{
    printf("Hello World from the GPU!\n");
}

int main(void)
{
    hello_from_gpu<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}

Save the above code into file hello.cu and compile it using nvcc -arch=sm_37 hello.cu -o hello and then run the executable ./hello. If it is successful, you will see the message Hello World from the GPU!.

antoni-2 commented 5 months ago

I am sending the output from the commands after creating the hello.cu file: nvcc -arch=sm_37 hello.cu -o hello

nvcc warning : The 'compute_35', 'compute_37', 'sm_35', and 'sm_37' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

./hello

Hello World from the GPU!

It seems that the cudaDeviceSynchronize() command works correctly outside GPUMD on the cluster I use. Unfortunately, I do not know what is the origin of such behaviour (that the command works outside GPUMD, and within it does not). Do you have an idea?

brucefan1983 commented 5 months ago

Being able to compile and run the simplest CUDA code means you have a working CUDA platform.

Then did you run gpumd (or nep) from comand line directly? The error

Error text: no kernel image is available for execution on the device

means that your executable was not compiled to target your GPU architecture. However, you showed that you have used -arch=sm_37 to compile, which macthes the K80 GPU you mentioned. So I am really puzzled.

antoni-2 commented 5 months ago

The error log I reported at the beginning of this issue, was shown after running the "nep" command directly from the command line. This was done in the directory with the input files (nep.in, train.xyz, and test.xyz). I did not use "gpumd" command yet.

brucefan1983 commented 5 months ago

If possible, could you change a platform to test?

tamaswells commented 5 months ago

I encountered a similar problem before. Just changed -arch=sm_XX to a smaller number and the problem was solved.

antoni-2 commented 5 months ago

Thanks for the tip. Unfortunately, in my case, the compilation with a lower number in -arch=sm_XX resulted in the same effect. The tested options were: -arch=sm_35 (lowest possible value for the compilation with each of two different nvcc versions, that are available on the cluster I use) -arch=native (this is probably equal to sm_37. This was also tried with different versions of nvcc)

brucefan1983 commented 2 months ago

I would like to close this if there is no more discussion. I believe this is a problem related to the CUDA environment instead of GPUMD.