build failed on jetson agx orin (Error generating file: build/CMakeFiles/ctranslate2.dir/src/ops/flash-attention/./ctranslate2_generated_flash_fwd_split_hdim96_fp16_sm80.cu.o)

I got the following error message when doing "make -j10"

CMake Error at ctranslate2_generated_flash_fwd_split_hdim96_fp16_sm80.cu.o.Release.cmake:280 (message):
  Error generating file
  /workspace/workbench/ctranslate2/build/CMakeFiles/ctranslate2.dir/src/ops/flash-attention/./ctranslate2_generated_flash_fwd_split_hdim96_fp16_sm80.cu.o

Here is the full log:

# make -j10
[  1%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/ops/flash-attention/ctranslate2_generated_flash_fwd_split_hdim96_fp16_sm80.cu.o
[  1%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/cuda/ctranslate2_generated_primitives.cu.o
[  2%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/cuda/ctranslate2_generated_random.cu.o
[  2%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/ops/ctranslate2_generated_alibi_add_gpu.cu.o
[  3%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/ops/ctranslate2_generated_bias_add_gpu.cu.o
[  3%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/ops/ctranslate2_generated_concat_split_slide_gpu.cu.o
[  4%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/ops/ctranslate2_generated_conv1d_gpu.cu.o
[  4%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/ops/ctranslate2_generated_flash_attention_gpu.cu.o
[  5%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/ops/ctranslate2_generated_dequantize_gpu.cu.o
[  6%] Building NVCC (Device) object CMakeFiles/ctranslate2.dir/src/ops/ctranslate2_generated_gather_gpu.cu.o
/workspace/workbench/ctranslate2/include/ctranslate2/ops/flash-attention/flash_fwd_launch_template.h(15): warning: attribute "__global__" does not apply here

/workspace/workbench/ctranslate2/include/ctranslate2/ops/flash-attention/flash_fwd_launch_template.h(15): error: incomplete type is not allowed

/workspace/workbench/ctranslate2/include/ctranslate2/ops/flash-attention/flash_fwd_launch_template.h(15): error: identifier "__grid_constant__" is undefined

/workspace/workbench/ctranslate2/include/ctranslate2/ops/flash-attention/flash_fwd_launch_template.h(15): error: expected a ")"

/workspace/workbench/ctranslate2/include/ctranslate2/ops/flash-attention/flash_fwd_launch_template.h(15): error: expected a ";"

4 errors detected in the compilation of "/workspace/workbench/ctranslate2/src/ops/flash-attention/flash_fwd_split_hdim96_fp16_sm80.cu".
CMake Error at ctranslate2_generated_flash_fwd_split_hdim96_fp16_sm80.cu.o.Release.cmake:280 (message):
  Error generating file
  /workspace/workbench/ctranslate2/build/CMakeFiles/ctranslate2.dir/src/ops/flash-attention/./ctranslate2_generated_flash_fwd_split_hdim96_fp16_sm80.cu.o

make[2]: *** [CMakeFiles/ctranslate2.dir/build.make:371: CMakeFiles/ctranslate2.dir/src/ops/flash-attention/ctranslate2_generated_flash_fwd_split_hdim96_fp16_sm80.cu.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/workspace/workbench/ctranslate2/src/ops/flash_attention_gpu.cu: In function 'void ctranslate2::ops::set_params_splitkv(Flash_fwd_params&, int, int, int, int, int, int, int, cudaDeviceProp*)':
/workspace/workbench/ctranslate2/src/ops/flash_attention_gpu.cu:162:1: warning: unused parameter 'head_size_rounded' [-Wunused-parameter]
  161 |                                    const int num_heads, const int head_size, const int max_seqlen_k, const int max_seqlen_q,
      |                                                                                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  162 |                                    const int head_size_rounded,
      | ^

make[1]: *** [CMakeFiles/Makefile2:98: CMakeFiles/ctranslate2.dir/all] Error 2
make: *** [Makefile:136: all] Error 2

I run cmake with like following without problems.

# cmake -DWITH_MKL=OFF -DWITH_CUDA=ON -DWITH_CUDNN=ON -DOPENMP_RUNTIME=COMP -DBUILD_CLI=OFF -DCUDA_DYNAMIC_LOADING=ON ..
-- The C compiler identification is GNU 9.4.0
-- The CXX compiler identification is GNU 9.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- Build spdlog: 1.10.0
-- Build type: Release
-- Compiling for multiple CPU ISA and enabling runtime dispatch
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Using OpenMP: /usr/lib/gcc/aarch64-linux-gnu/9/libgomp.so;/usr/lib/aarch64-linux-gnu/libpthread.so
CMake Warning (dev) at CMakeLists.txt:433 (find_package):
  Policy CMP0146 is not set: The FindCUDA module is removed.  Run "cmake
  --help-policy CMP0146" for policy details.  Use the cmake_policy command to
  set the policy and suppress this warning.

This warning is for project developers.  Use -Wno-dev to suppress it.

-- Found CUDA: /usr/local/cuda (found suitable version "11.4", minimum required is "11.0")
-- Autodetected CUDA architecture(s):  8.7
-- NVCC host compiler: /usr/bin/c++
-- NVCC compilation flags: -std=c++17;-Xcompiler=-fopenmp;-gencode;arch=compute_87,code=sm_87;--expt-relaxed-constexpr;--expt-extended-lambda
-- Found cuDNN include directory: /usr/include
-- Found cuDNN libraries: /usr/lib/aarch64-linux-gnu/libcudnn.so
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_VISIBILITY - Success
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY
-- Performing Test COMPILER_HAS_HIDDEN_INLINE_VISIBILITY - Success
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR
-- Performing Test COMPILER_HAS_DEPRECATED_ATTR - Success
-- Configuring done (2.3s)
-- Generating done (0.1s)
-- Build files have been written to: /workspace/workbench/ctranslate2/build

OpenNMT / CTranslate2

build failed on jetson agx orin (Error generating file: build/CMakeFiles/ctranslate2.dir/src/ops/flash-attention/./ctranslate2_generated_flash_fwd_split_hdim96_fp16_sm80.cu.o) #1771