Closed horrorChen closed 8 months ago
I also find another compile error when I use DPC++ on NVIDIA A100, but I can't fix it.
I want to test the performance of portBLAS gemm on tensor core, so I revise cmake/Modules/FindDPCPP.cmake
and add 2 lines to set the macro NVIDIA_GPU
if (${start_idx} AND ${sm_val} GREATER_EQUAL "80")
add_definitions(-DSB_ENABLE_JOINT_MATRIX=1)
add_definitions(-DNVIDIA_GPU=1)
list(APPEND DPCPP_FLAGS "-Xclang;-cl-mad-enable")
list(APPEND DPCPP_FLAGS "-DSYCL_EXT_ONEAPI_MATRIX_VERSION=4")
list(APPEND DPCPP_FLAGS "-DSB_ENABLE_JOINT_MATRIX=1")
list(APPEND DPCPP_FLAGS "-DNVIDIA_GPU=1")
endif()
Meanwhile, I revise the call of API joint_matrix_mad
in src/operations/blas3/gemm_local_joint_matrix.hpp:830
according to the update of DPC++
// reg_res[frag] = joint_matrix_mad(sg, inA, inB, reg_res[frag]);
joint_matrix_mad(sg, reg_res[frag], inA, inB, reg_res[frag]);
After that I compile portBLAS and sample with the command
$ CC=clang CXX=clang++ cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_TARGET="nvptx64-nvidia-cuda" -DDPCPP_SYCL_ARCH="sm_80" -DCMAKE_PREFIX_PATH=/opt/OpenBLAS -DCMAKE_THREAD_LIBS_INIT=-lpthread -DBLAS_ENABLE_TESTING=OFF -DBLAS_ENABLE_BENCHMARK=OFF -DCMAKE_BUILD_TYPE=Debug
$ ninja
and get the error
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:562:13: error: use of undeclared identifier 'get_wi_data'
562 | get_wi_data(sg, float_out)[i] = alpha_ * data_left;
| ^
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:607:9: error: use of undeclared identifier 'get_wi_data'
607 | get_wi_data(sg, float_out)[i] =
| ^
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:576:40: error: use of undeclared identifier 'get_wi_data'
576 | static_cast<element_t>(get_wi_data(sg, reg_res[frag])[i]);
| ^
In DPC++ sycl/include/sycl/ext/oneapi/matrix/matrix-unified.hpp
, the call of sycl::ext::oneapi::detail::get_wi_data
is used for Intel GPU, while for NVIDIA GPU and AMD GPU, sycl visits the object in joint_matrix jm by directly call jm.matrix_impl.wi_marray
, but it doesn't work for portBLAS. If I define the macro
#define get_wi_data(sg, jm) jm.matrix_impl.wi_marray
and compile, I will get another error
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:562:13: error: no member named 'matrix_impl' in 'sycl::ext::oneapi::experimental::matrix::joint_matrix<sycl::sub_group, float, sycl::ext::oneapi::experimental::matrix::use::accumulator, 16, 16>'
562 | get_wi_data(sg, float_out)[i] = alpha_ * data_left;
| ^ ~~~~~~~~~
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:33:32: note: expanded from macro 'get_wi_data'
33 | #define get_wi_data(sg, jm) jm.matrix_impl.wi_marray
| ~~ ^
It seems like that since the code in gemm_local_joint_matrix.hpp
is packed as device code in DPC++, it is unable to access the struct matrix_impl
.
I want to learn how to fix it and test the performance of gemm on tensor core.
Is anyone working on this issue?🤔
Hi @horrorChen,
Thank you for opening this issue. The complex
header problem has been addressed in #484. The joint_matrix
support in portBLAS is out of date and needs to be refactored to match the latest changes in DPC++. We are currently working on addressing this and will put up the changes for review on portBLAS soon.
As for the changes you made to the FindDPCPP.cmake
file, you don't need to add NVIDIA_GPU
as a definition in this file, but instead get the same behaviour by passing the -DTUNING_TARGET=NVIDIA_GPU
cmake flag in your build command. Thanks.
Thanks for your reply @muhammad-tanvir-1211.
Actually, I find that the call of get_wi_data
is used to get the result of $AB$ and calculate the answer of $\alpha AB + \beta C$ with the joint_matrix
struct, but it is unnecessary to address it with joint_matrix
. The data can also be loaded with joint_matrix_load
API and do the calculation. The idea may be helpful.
Hope for your update of work.
Hi @horrorChen The PR for joint_matrix fix (https://github.com/codeplaysoftware/portBLAS/pull/491) is now up, please use the new changes to build the library and let us know if it is still causing any issues. Thanks.
hi @muhammad-tanvir-1211 I have tested the PR and sample_gemm works. But something went wrong before completiong of compile.
I use the command below.
$ CC=clang CXX=clang++ cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_TARGET="nvptx64-nvidia-cuda" -DDPCPP_SYCL_ARCH="sm_80" -DTUNING_TARGET=NVIDIA_GPU -DCMAKE_PREFIX_PATH=/opt/OpenBLAS -DCMAKE_THREAD_LIBS_INIT=-lpthread -DBLAS_ENABLE_TESTING=OFF -DBLAS_ENABLE_BENCHMARK=OFF
$ ninja
The error info is like
[419/420] Linking CXX shared library libportblas.so.0.1.0
FAILED: libportblas.so.0.1.0
.....
ptxas fatal : Unresolved extern function 'fabsf'
llvm-foreach:
ptxas fatal : Unresolved extern function 'fabsf'
llvm-foreach:
clang++: error: ptxas command failed with exit code 255 (use -v to see invocation)
Did you ever encounter this problem?
Hi @horrorChen Yes, Sorry I forgot to mention this earlier. There are a few symbols missing in the compiler for NVIDIA backend. This PR (https://github.com/intel/llvm/pull/12218) fixes the above linker error.
Hi @muhammad-tanvir-1211 Thanks for your notification. PortBLAS now works well in my environment after updating DPC++. Grateful for your work.
https://github.com/intel/llvm/pull/12218 has just been merged.
In file
portBLAS/include/blas_meta.h
, the header fileno longer exists in DPC++, instead you can use
which is located in
llvm-dpcpp/build/include/sycl/ext/oneapi/experimental/complex/complex.hpp
.