codeplaysoftware / portBLAS

An implementation of BLAS using the SYCL open standard.
Apache License 2.0
250 stars 48 forks source link

DPC++ compile bug #483

Closed horrorChen closed 8 months ago

horrorChen commented 9 months ago

In file portBLAS/include/blas_meta.h, the header file

#include <ext/oneapi/experimental/sycl_complex.hpp>

no longer exists in DPC++, instead you can use

#include <ext/oneapi/experimental/complex/complex.hpp>

which is located in llvm-dpcpp/build/include/sycl/ext/oneapi/experimental/complex/complex.hpp.

horrorChen commented 9 months ago

I also find another compile error when I use DPC++ on NVIDIA A100, but I can't fix it.

I want to test the performance of portBLAS gemm on tensor core, so I revise cmake/Modules/FindDPCPP.cmake and add 2 lines to set the macro NVIDIA_GPU

  if (${start_idx} AND ${sm_val} GREATER_EQUAL "80")
    add_definitions(-DSB_ENABLE_JOINT_MATRIX=1)
    add_definitions(-DNVIDIA_GPU=1)
    list(APPEND DPCPP_FLAGS "-Xclang;-cl-mad-enable")
    list(APPEND DPCPP_FLAGS "-DSYCL_EXT_ONEAPI_MATRIX_VERSION=4")
    list(APPEND DPCPP_FLAGS "-DSB_ENABLE_JOINT_MATRIX=1")
    list(APPEND DPCPP_FLAGS "-DNVIDIA_GPU=1")
  endif()

Meanwhile, I revise the call of API joint_matrix_mad in src/operations/blas3/gemm_local_joint_matrix.hpp:830 according to the update of DPC++

        // reg_res[frag] = joint_matrix_mad(sg, inA, inB, reg_res[frag]);
        joint_matrix_mad(sg, reg_res[frag], inA, inB, reg_res[frag]);

After that I compile portBLAS and sample with the command

$ CC=clang CXX=clang++ cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_TARGET="nvptx64-nvidia-cuda" -DDPCPP_SYCL_ARCH="sm_80" -DCMAKE_PREFIX_PATH=/opt/OpenBLAS -DCMAKE_THREAD_LIBS_INIT=-lpthread -DBLAS_ENABLE_TESTING=OFF -DBLAS_ENABLE_BENCHMARK=OFF -DCMAKE_BUILD_TYPE=Debug
$ ninja

and get the error

portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:562:13: error: use of undeclared identifier 'get_wi_data'
  562 |             get_wi_data(sg, float_out)[i] = alpha_ * data_left;
      |             ^
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:607:9: error: use of undeclared identifier 'get_wi_data'
  607 |         get_wi_data(sg, float_out)[i] =
      |         ^
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:576:40: error: use of undeclared identifier 'get_wi_data'
  576 |                 static_cast<element_t>(get_wi_data(sg, reg_res[frag])[i]);
      |                                        ^

In DPC++ sycl/include/sycl/ext/oneapi/matrix/matrix-unified.hpp, the call of sycl::ext::oneapi::detail::get_wi_data is used for Intel GPU, while for NVIDIA GPU and AMD GPU, sycl visits the object in joint_matrix jm by directly call jm.matrix_impl.wi_marray, but it doesn't work for portBLAS. If I define the macro

#define get_wi_data(sg, jm) jm.matrix_impl.wi_marray

and compile, I will get another error

portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:562:13: error: no member named 'matrix_impl' in 'sycl::ext::oneapi::experimental::matrix::joint_matrix<sycl::sub_group, float, sycl::ext::oneapi::experimental::matrix::use::accumulator, 16, 16>'
  562 |             get_wi_data(sg, float_out)[i] = alpha_ * data_left;
      |             ^               ~~~~~~~~~
portBLAS/samples/../src/operations/blas3/gemm_local_joint_matrix.hpp:33:32: note: expanded from macro 'get_wi_data'
   33 | #define get_wi_data(sg, jm) jm.matrix_impl.wi_marray
      |                             ~~ ^

It seems like that since the code in gemm_local_joint_matrix.hpp is packed as device code in DPC++, it is unable to access the struct matrix_impl.

I want to learn how to fix it and test the performance of gemm on tensor core.

horrorChen commented 9 months ago

Is anyone working on this issue?🤔

muhammad-tanvir-1211 commented 9 months ago

Hi @horrorChen, Thank you for opening this issue. The complex header problem has been addressed in #484. The joint_matrix support in portBLAS is out of date and needs to be refactored to match the latest changes in DPC++. We are currently working on addressing this and will put up the changes for review on portBLAS soon. As for the changes you made to the FindDPCPP.cmake file, you don't need to add NVIDIA_GPU as a definition in this file, but instead get the same behaviour by passing the -DTUNING_TARGET=NVIDIA_GPU cmake flag in your build command. Thanks.

horrorChen commented 9 months ago

Thanks for your reply @muhammad-tanvir-1211.

Actually, I find that the call of get_wi_data is used to get the result of $AB$ and calculate the answer of $\alpha AB + \beta C$ with the joint_matrix struct, but it is unnecessary to address it with joint_matrix. The data can also be loaded with joint_matrix_load API and do the calculation. The idea may be helpful.

Hope for your update of work.

muhammad-tanvir-1211 commented 8 months ago

Hi @horrorChen The PR for joint_matrix fix (https://github.com/codeplaysoftware/portBLAS/pull/491) is now up, please use the new changes to build the library and let us know if it is still causing any issues. Thanks.

horrorChen commented 8 months ago

hi @muhammad-tanvir-1211 I have tested the PR and sample_gemm works. But something went wrong before completiong of compile.

I use the command below.

$ CC=clang CXX=clang++ cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_TARGET="nvptx64-nvidia-cuda" -DDPCPP_SYCL_ARCH="sm_80" -DTUNING_TARGET=NVIDIA_GPU -DCMAKE_PREFIX_PATH=/opt/OpenBLAS -DCMAKE_THREAD_LIBS_INIT=-lpthread -DBLAS_ENABLE_TESTING=OFF -DBLAS_ENABLE_BENCHMARK=OFF
$ ninja

The error info is like

[419/420] Linking CXX shared library libportblas.so.0.1.0
FAILED: libportblas.so.0.1.0 
.....
ptxas fatal   : Unresolved extern function 'fabsf'
llvm-foreach: 
ptxas fatal   : Unresolved extern function 'fabsf'
llvm-foreach: 
clang++: error: ptxas command failed with exit code 255 (use -v to see invocation)

Did you ever encounter this problem?

muhammad-tanvir-1211 commented 8 months ago

Hi @horrorChen Yes, Sorry I forgot to mention this earlier. There are a few symbols missing in the compiler for NVIDIA backend. This PR (https://github.com/intel/llvm/pull/12218) fixes the above linker error.

horrorChen commented 8 months ago

Hi @muhammad-tanvir-1211 Thanks for your notification. PortBLAS now works well in my environment after updating DPC++. Grateful for your work.

hdelan commented 8 months ago

https://github.com/intel/llvm/pull/12218 has just been merged.