NVIDIA / cudnn-frontend

cudnn_frontend provides a c++ wrapper for the cudnn backend API and samples on how to use it
MIT License
443 stars 90 forks source link

Matmul test failure #78

Open shiwenloong opened 5 months ago

shiwenloong commented 5 months ago

I encountered a test failure after building and running the tests. Here are the details:

I followed the build instructions as provided in the README:

mkdir build
cd build
cmake ..
make -j8

Output is:

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found CUDAToolkit: /usr/local/cuda-12.2/targets/x86_64-linux/include (found version "12.2.140")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Performing Test HAVE_FLAG__ffile_prefix_map__nvme2_medsam_cuda_mode_cudnn_frontend_build__deps_catch2_src__
-- Performing Test HAVE_FLAG__ffile_prefix_map__nvme2_medsam_cuda_mode_cudnn_frontend_build__deps_catch2_src__ - Success
-- cudnn found at /usr/local/cuda-12.2/lib64/libcudnn.so.
-- Found LIBRARY: /usr/local/cuda-12.2/include
-- cuDNN: /usr/local/cuda-12.2/lib64/libcudnn.so
-- cuDNN: /usr/local/cuda-12.2/include
-- cudnn_adv_infer found at /usr/local/cuda-12.2/lib64/libcudnn_adv_infer.so.
-- cudnn_adv_train found at /usr/local/cuda-12.2/lib64/libcudnn_adv_train.so.
-- cudnn_cnn_infer found at /usr/local/cuda-12.2/lib64/libcudnn_cnn_infer.so.
-- cudnn_cnn_train found at /usr/local/cuda-12.2/lib64/libcudnn_cnn_train.so.
-- cudnn_ops_infer found at /usr/local/cuda-12.2/lib64/libcudnn_ops_infer.so.
-- cudnn_ops_train found at /usr/local/cuda-12.2/lib64/libcudnn_ops_train.so.
-- cudnn found at /usr/local/cuda-12.2/lib64/libcudnn.so.
-- cuDNN: /usr/local/cuda-12.2/lib64/libcudnn.so
-- cuDNN: /usr/local/cuda-12.2/include
-- cudnn_adv_infer found at /usr/local/cuda-12.2/lib64/libcudnn_adv_infer.so.
-- cudnn_adv_train found at /usr/local/cuda-12.2/lib64/libcudnn_adv_train.so.
-- cudnn_cnn_infer found at /usr/local/cuda-12.2/lib64/libcudnn_cnn_infer.so.
-- cudnn_cnn_train found at /usr/local/cuda-12.2/lib64/libcudnn_cnn_train.so.
-- cudnn_ops_infer found at /usr/local/cuda-12.2/lib64/libcudnn_ops_infer.so.
-- cudnn_ops_train found at /usr/local/cuda-12.2/lib64/libcudnn_ops_train.so.
-- Configuring done (6.0s)
-- Generating done (0.0s)
-- Build files have been written to: /nvme2/medsam/cuda-mode/cudnn-frontend/build
[100%] Linking CXX executable ../bin/samples
Warning: Unused direct dependencies:
        /usr/local/cuda-12.2/lib64/libnvrtc.so.12
        /usr/local/cuda-12.2/lib64/libnvrtc-builtins.so.12.2
        /lib/x86_64-linux-gnu/libcuda.so.1
        /usr/local/cuda-12.2/lib64/libnvJitLink.so.12
        /usr/local/cuda-12.2/lib64/libcudnn_adv_train.so.8
        /usr/local/cuda-12.2/lib64/libcudnn_ops_train.so.8
        /usr/local/cuda-12.2/lib64/libcudnn_cnn_train.so.8
        /usr/local/cuda-12.2/lib64/libcudnn_adv_infer.so.8
        /usr/local/cuda-12.2/lib64/libcudnn_cnn_infer.so.8
        /usr/local/cuda-12.2/lib64/libcudnn_ops_infer.so.8
[100%] Built target samples

Then I run the matmul test

CUDNN_FRONTEND_LOG_FILE=stdout CUDNN_FRONTEND_LOG_INFO=1 ./build/bin/samples MatMul

Output is:

Filters: "MatMul"
Randomness seeded to: 1045110732
[cudnn_frontend] INFO: Validating matmul node GEMM...
[cudnn_frontend] INFO: Inferrencing properties for matmul node GEMM...
[cudnn_frontend] INFO: Creating cudnn tensors for node named 'GEMM':
[cudnn_frontend] INFO: CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: ["BFLOAT16"] Id: 2 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 16,32,128 ] Str [ 4096,128,1 ] isVirtual: 0 isByValue: 0 Alignment: 16 reorder_type: ["NONE"]
[cudnn_frontend] INFO: CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: ["BFLOAT16"] Id: 3 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 16,128,64 ] Str [ 8192,64,1 ] isVirtual: 0 isByValue: 0 Alignment: 16 reorder_type: ["NONE"]
[cudnn_frontend] INFO: CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: ["FLOAT"] Id: 4 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 16,32,64 ] Str [ 2048,64,1 ] isVirtual: 0 isByValue: 0 Alignment: 16 reorder_type: ["NONE"]
[cudnn_frontend] INFO: Building MatmulNode operations GEMM...
[cudnn_frontend] CUDNN_BACKEND_MATMUL_DESCRIPTOR : Math precision ["FLOAT"]
[cudnn_frontend] CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR has 1operations.
Tag: Matmul_

[cudnn_frontend] INFO:  Getting plan from heuristics for Matmul_ ...
[cudnn_frontend] CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR :
Heuristic Mode 3 has 6 configurations 
[cudnn_frontend] INFO: get_heuristics_list statuses: CUDNN_STATUS_SUCCESS 
[cudnn_frontend] INFO: config list has 6 configurations.
[cudnn_frontend] INFO: config list has 6 good configurations.
[cudnn_frontend] INFO: Extracting engine configs.
[cudnn_frontend] INFO: Querying engine config properties
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 0 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 1 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 2 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 3 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 4 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudnn_frontend] INFO: Building plan at index 5 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: plans.check_support(h) at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/graph_interface.h:260

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
samples is a Catch2 v3.3.2 host application.
Run with -? for options

-------------------------------------------------------------------------------
Matmul
-------------------------------------------------------------------------------
/nvme2/medsam/cuda-mode/cudnn-frontend/samples/cpp/matmuls.cpp:31
...............................................................................

/nvme2/medsam/cuda-mode/cudnn-frontend/samples/cpp/matmuls.cpp:80: FAILED:
  REQUIRE( graph.check_support(handle).is_good() )
with expansion:
  false

===============================================================================
test cases:  1 |  0 passed | 1 failed
assertions: 11 | 10 passed | 1 failed
Anerudhan commented 5 months ago

Hi @shiwenloong ,

Thanks for reporting this issue.

I have added an experimental branch issues/75_and_78 to print the cudaGetLastError().

Please run, CUDNN_LOGLEVEL_DBG=3 CUDNN_LOGDEST_DBG=backend_api.log CUDNN_FRONTEND_LOG_FILE=stdout CUDNN_FRONTEND_LOG_INFO=1 ./build/bin/samples "MatMul" and please attach both backend_api.log and frontend log for us to help debug.

Thanks

shiwenloong commented 5 months ago

I run CUDNN_LOGLEVEL_DBG=3 CUDNN_LOGDEST_DBG=backend_api.log CUDNN_FRONTEND_LOG_FILE=stdout CUDNN_FRONTEND_LOG_INFO=1 ./build/bin/samples "MatMul" But I can't find the backend_api.log file. This is the frontend log:

Filters: "MatMul"
Randomness seeded to: 3739293787
[cudnn_frontend] INFO: Validating matmul node GEMM...
[cudnn_frontend] INFO: Inferrencing properties for matmul node GEMM...
[cudnn_frontend] INFO: Creating cudnn tensors for node named 'GEMM':
[cudnn_frontend] INFO: CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: ["BFLOAT16"] Id: 2 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 16,32,128 ] Str [ 4096,128,1 ] isVirtual: 0 isByValue: 0 Alignment: 16 reorder_type: ["NONE"]
[cudnn_frontend] INFO: CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: ["BFLOAT16"] Id: 3 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 16,128,64 ] Str [ 8192,64,1 ] isVirtual: 0 isByValue: 0 Alignment: 16 reorder_type: ["NONE"]
[cudnn_frontend] INFO: CUDNN_BACKEND_TENSOR_DESCRIPTOR : Datatype: ["FLOAT"] Id: 4 nDims 3 VectorCount: 1 vectorDimension -1 Dim [ 16,32,64 ] Str [ 2048,64,1 ] isVirtual: 0 isByValue: 0 Alignment: 16 reorder_type: ["NONE"]
[cudnn_frontend] INFO: Building MatmulNode operations GEMM...
[cudnn_frontend] CUDNN_BACKEND_MATMUL_DESCRIPTOR : Math precision ["FLOAT"]
[cudnn_frontend] CUDNN_BACKEND_OPERATIONGRAPH_DESCRIPTOR has 1operations.
Tag: Matmul_

[cudnn_frontend] INFO:  Getting plan from heuristics for Matmul_ ...
[cudnn_frontend] CUDNN_BACKEND_ENGINEHEUR_DESCRIPTOR :
Heuristic Mode 3 has 6 configurations 
[cudnn_frontend] INFO: get_heuristics_list statuses: CUDNN_STATUS_SUCCESS 
[cudnn_frontend] INFO: config list has 6 configurations.
[cudnn_frontend] INFO: config list has 6 good configurations.
[cudnn_frontend] INFO: Extracting engine configs.
[cudnn_frontend] INFO: Querying engine config properties
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudaGetLastError] ERROR: 0
[cudnn_frontend] INFO: Building plan at index 0 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudaGetLastError] ERROR: 0
[cudnn_frontend] INFO: Building plan at index 1 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudaGetLastError] ERROR: 0
[cudnn_frontend] INFO: Building plan at index 2 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudaGetLastError] ERROR: 0
[cudnn_frontend] INFO: Building plan at index 3 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudaGetLastError] ERROR: 0
[cudnn_frontend] INFO: Building plan at index 4 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED. ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] because plan building failed at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/plans.h:179
[cudaGetLastError] ERROR: 0
[cudnn_frontend] INFO: Building plan at index 5 gave ["GRAPH_EXECUTION_PLAN_CREATION_FAILED"] with message: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_EXECUTION_FAILED
[cudnn_frontend] ERROR: plans.check_support(h) at /nvme2/medsam/cuda-mode/cudnn-frontend/include/cudnn_frontend/graph_interface.h:260

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
samples is a Catch2 v3.3.2 host application.
Run with -? for options

-------------------------------------------------------------------------------
Matmul
-------------------------------------------------------------------------------
/nvme2/medsam/cuda-mode/cudnn-frontend/samples/cpp/matmuls.cpp:31
...............................................................................

/nvme2/medsam/cuda-mode/cudnn-frontend/samples/cpp/matmuls.cpp:80: FAILED:
  REQUIRE( graph.check_support(handle).is_good() )
with expansion:
  false

===============================================================================
test cases:  1 |  0 passed | 1 failed
assertions: 11 | 10 passed | 1 failed