sycl backend support for AMD, NVIDIA, and Intel

KateBlueSky commented 11 months ago

Hi, I'm not sure if this SYCL backend I developed is something the Lc0 team would be interested in or not? The SYCL backend is based off the CUDA backend and can run on Nvidia, AMD, or Intel GPUs. I've tested it using the benchmark functionality and the 752187.pb.gz network, but there still needs to be a lot more testing done before merging this code. I was wondering if there is any automated test that I could run or how do you suggest I test the code? Also merging the mason build for the sycl code to the meson build on main needs to be done.

To compile the SYCL code for AMD or Nvidia you would need to do the following: 1.) Download and set up the DPC++ LLVM compiler. 2.) Run the buildSycl batch script included with this code. For AMD use the following command: CC=clang CXX=clang++ ./buildSycl.sh -DUSE_AMD_BACKEND=true -DUSE_SM=gfx90a (Set is to whatever architecture your using -i.e. for MI100 use gfx908 for MI250 use gfx90a) For Nvidia do the following CC=clang CXX=clang++ ./buildSycl.sh -DUSE_NVIDIA_BACKEND=true -DUSE_SM=61 (Set is to whatever architecture your using -i.e. for A100 use 80 for H100 use 90)

To compile the code for Intel GPUs you would need to do the following: 1.) Download OneAPI from the Intel webiste. 2.) Run the buildSycl batch script included with this code. Use the following command CC=icpx CXX=icpx ./buildSycl.sh -DUSE_L0_BACKEND=true

To run the SYCL on the benchmark you would do the following command: ./lc0_sycl benchmark -b sycl

Thanks

borg323 commented 11 months ago

Thank you very much for your contribution, we are certainly interested. This is a large change so it will take some time to digest, so only a few quick comments for now:

For tests you can use the check backend. For this you need to have a second backend compiled in as a reference, we usually use the blas one. Then you can do something like: ./lc0 benchmark --backend=check --backend-opts=mode=display,freq=1.0,sycl,blas which will run the benchmark and print the relative and absolute error between the backends for each batch.
Are the files in the dpct directory unmodified? In that case they can probably move to the third_party directory.

KateBlueSky commented 11 months ago

Are the files in the dpct directory unmodified? In that case they can probably move to the third_party directory.

For this I can just remove the dpct header files. They are part of an open source tool called SYCLomatic that converts CUDA code to SYCL code and aren't really needed. Also, I'll do some testing with the reference backend you mention. Thanks

borg323 commented 10 months ago

I tried to build the l0 sycl backend on windows, but when running it errors out with (sycl): Unhandled exception: No kernel named _ZTSZN6lczero15sycldnn_backend17copyTypeConvertedIffEEvPT_PT0_iRN4sycl3_V15queueEEUlNS7_7nd_itemILi3EEEE_ was found -46 (PI_ERROR_INVALID_KERNEL_NAME) or (sycl-fp16): Unhandled exception: No kernel named _ZTSZN6lczero15sycldnn_backend17copyTypeConvertedIN4sycl3_V16detail9half_impl4halfEfEEvPT_PT0_iRNS3_5queueEEUlNS3_7nd_itemILi3EEEE_ was found -46 (PI_ERROR_INVALID_KERNEL_NAME)

Am I doing something wrong, maybe missing some compiler/linker flags?

Here is the minimal patch I used to build, including build script:

diff --git a/build-sycl.cmd b/build-sycl.cmd
new file mode 100644
index 0000000..270c184
--- /dev/null
+++ b/build-sycl.cmd
@@ -0,0 +1,61 @@
+@echo off
+setlocal
+
+rem 1. Set the following for the options you want to build.
+set CUDNN=true
+set CUDA=true
+set DX12=false
+set OPENCL=false
+set MKL=false
+set DNNL=true
+set OPENBLAS=false
+set EIGEN=false
+set TEST=false
+
+rem 2. Edit the paths for the build dependencies.
+set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0
+set CUDNN_PATH=%CUDA_PATH%
+set OPENBLAS_PATH=C:\OpenBLAS
+set MKL_PATH=C:\Program Files (x86)\Intel\oneAPI\mkl\latest\
+set DNNL_PATH=C:\Program Files (x86)\Intel\oneAPI\dnnl\latest\cpu_iomp
+set OPENCL_LIB_PATH=%CUDA_PATH%\lib\x64
+set OPENCL_INCLUDE_PATH=%CUDA_PATH%\include
+
+rem 3. In most cases you won't need to change anything further down.
+echo Deleting build directory:
+rd /s build
+
+set CC=icx
+set CXX=icx
+set WINDRES=rc
+
+set backend=ninja
+
+set BLAS=true
+if %MKL%==false if %DNNL%==false if %OPENBLAS%==false if %EIGEN%==false set BLAS=false
+
+if "%CUDA_PATH%"=="%CUDNN_PATH%" (
+  set CUDNN_LIB_PATH=%CUDNN_PATH%\lib\x64
+  set CUDNN_INCLUDE_PATH=%CUDNN_PATH%\include
+) else (
+  set CUDNN_LIB_PATH=%CUDA_PATH%\lib\x64,%CUDNN_PATH%\lib\x64
+  set CUDNN_INCLUDE_PATH=%CUDA_PATH%\include,%CUDNN_PATH%\include
+)
+
+if %CUDNN%==true set PATH=%CUDA_PATH%\bin;%PATH%
+
+meson build --backend %backend% --buildtype release -Ddx=%DX12% -Dcudnn=%CUDNN% -Dplain_cuda=%CUDA% ^
+-Dopencl=%OPENCL% -Dblas=%BLAS% -Dmkl=%MKL% -Dopenblas=%OPENBLAS% -Ddnnl=%DNNL% -Dgtest=%TEST% ^
+-Dcudnn_include="%CUDNN_INCLUDE_PATH%" -Dcudnn_libdirs="%CUDNN_LIB_PATH%" ^
+-Dmkl_include="%MKL_PATH%\include" -Dmkl_libdirs="%MKL_PATH%\lib\intel64" -Ddnnl_dir="%DNNL_PATH%" ^
+-Dopencl_libdirs="%OPENCL_LIB_PATH%" -Dopencl_include="%OPENCL_INCLUDE_PATH%" ^
+-Dopenblas_include="%OPENBLAS_PATH%\include" -Dopenblas_libdirs="%OPENBLAS_PATH%\lib" ^
+-Ddefault_library=static -DUSE_SYCL=true -DUSE_L0_BACKEND=true -Db_vscrt=md -Db_lto=false
+
+if errorlevel 1 exit /b
+
+pause
+
+cd build
+
+ninja
diff --git a/meson.build b/meson.build
index bd05466..ce7d73d 100644
--- a/meson.build
+++ b/meson.build
@@ -36,8 +36,8 @@

 project('lc0', 'cpp',
         #default_options : ['cpp_std=c++17', 'b_ndebug=if-release', 'warning_level=3', 'b_lto=true', 'b_vscrt=mt', ],
-        default_options : ['cpp_std=c++20', 'b_ndebug=false'],
-        meson_version: '>=0.52')
+        default_options : ['cpp_std=c++17', 'b_ndebug=false'],
+        meson_version: '>=0.64')

 cc = meson.get_compiler('cpp')
@@ -242,9 +242,9 @@ files += [
 ]

 deps += dependency('zlib', fallback: ['zlib', 'zlib_dep'])
-deps += cc.find_library('pthread', required: true) 
+#deps += cc.find_library('pthread', required: true) 

-files += 'src/utils/filesystem.posix.cc'
+files += 'src/utils/filesystem.win32.cc'
 includes += include_directories('src')

@@ -281,14 +281,14 @@ if get_option('USE_SYCL')
       if(get_option('USE_L0_BACKEND') == true)
         message('Building SYCL for the L0 backend')
         add_project_arguments('-DMKL_ILP64', language : 'cpp')
-        deps += cc.find_library('sycl', required: true)
+        deps += cc.find_library('sycl6', required: true)
         deps += cc.find_library('mkl_sycl', required: true)
         deps += cc.find_library('mkl_intel_ilp64', required: true)
         deps += cc.find_library('mkl_sequential', required: true)
         deps += cc.find_library('mkl_core', required: true)   
         deps += cc.find_library('OpenCL', required: true)  
-        deps += cc.find_library('dl', required: true)
-        deps += cc.find_library('m', required: true)
+#        deps += cc.find_library('dl', required: true)
+#        deps += cc.find_library('m', required: true)
         add_project_arguments('-DDEFAULT_MINIBATCH_SIZE=248', language : 'cpp')
         add_project_arguments(get_option('GPU_AOT'), language : 'cpp')
         mlink_args += get_option('GPU_AOT')
diff --git a/src/neural/factory.h b/src/neural/factory.h
index 4448aa3..08e4634 100644
--- a/src/neural/factory.h
+++ b/src/neural/factory.h
@@ -115,14 +115,14 @@ class NetworkFactory {
   friend class Register;
 };

-#define REGISTER_NETWORK_WITH_COUNTER2(name, func, priority, counter) \
-  namespace {                                                         \
-  static NetworkFactory::Register regH38fhs##counter(                 \
-      name,                                                           \
-      [](const std::optional<WeightsFile>& w, const OptionsDict& o) { \
-        return func(w, o);                                            \
-      },                                                              \
-      priority);                                                      \
+#define REGISTER_NETWORK_WITH_COUNTER2(name, func, priority, counter)          \
+  namespace {                                                                  \
+  std::unique_ptr<Network> func_##counter(const std::optional<WeightsFile>& w, \
+                                          const OptionsDict& o) {              \
+    return func(w, o);                                                         \
+  }                                                                            \
+  static NetworkFactory::Register regH38fhs##counter(name, func_##counter,     \
+                                                     priority);                \
   }
 #define REGISTER_NETWORK_WITH_COUNTER(name, func, priority, counter) \
   REGISTER_NETWORK_WITH_COUNTER2(name, func, priority, counter)
diff --git a/src/neural/sycl/layers.cc.dp.cpp b/src/neural/sycl/layers.cc.dp.cpp
index 11ad697..1d7d294 100644
--- a/src/neural/sycl/layers.cc.dp.cpp
+++ b/src/neural/sycl/layers.cc.dp.cpp
@@ -1536,6 +1536,7 @@ template <>
     sycl::half beta = 0;
     #endif

+#ifdef USE_CUBLAS
     sycl_queue.submit([&](sycl::handler &cgh) {

          cgh.host_task([=](sycl::interop_handle ih) {
@@ -1553,6 +1554,7 @@ template <>

          });   
    });
+#endif
 }

 template <>
@@ -2367,6 +2369,9 @@ static void cublasXGemmBatched(transpose_type transa,
     });
   }
   #else
+    std::int64_t strideA = transa == transpose_type_transpose ? lda * m : lda * k;
+    std::int64_t strideB = transb == transpose_type_transpose ? ldb * k : ldb * n;
+    std::int64_t strideC = ldc * n;
     oneapi::mkl::blas::column_major::gemm_batch(sycl_queue, transa, transb, m, n, k,  alpha, (const float *)A, lda, strideA, (const float *)B, ldb, strideB, beta, (float *)C, ldc, strideC, batchCount); 
   #endif
 }

KateBlueSky commented 10 months ago

Hi @borg323 what version of windows are you using? I think there's some issue with dpc++ llvm on Windows 10 and the linker dpc++ is using. It's related to this issue https://github.com/intel/llvm/issues/11568 Also, I need to test this on a windows machine too. Unfortunately, I've only tested the build using Ubuntu.

borg323 commented 10 months ago

I'm also using windows 10, but with the latest release of oneapi base. The compiler is icx and linker used is xilink, I'll try to use clang instead to see if it helps.

borg323 commented 10 months ago

I did get it to work eventually, but it was a bit messy: I did some experiments using the vector-add example, and only managed to make it work with icx handling the final link stage instead of the linker. Unfortunately I haven't managed to find a way to get meson to use icx as a linker yet, but doing it by had results in a working executable.

KateBlueSky commented 10 months ago

@borg323 For using meson on ubuntu and with the DPC++ compiler I had use the newest meson update from https://github.com/mesonbuild/meson . The OneAPI functionality for meson on windows and ubuntu was added in October of last year https://github.com/mesonbuild/meson/pull/10909 . I don't know if this helps you or not? On my end I still need to get rid of the dpct header files and do more testing with test you describe above.

borg323 commented 10 months ago

Here is an updated build script that makes a working windows binary with a few fragile hacks... I'm still looking for a meson solution.

@echo off
setlocal

rem 1. Set the following for the options you want to build.
set CUDNN=true
set CUDA=true
set DX12=false
set OPENCL=false
set MKL=false
set DNNL=true
set OPENBLAS=false
set EIGEN=false
set TEST=false

rem 2. Edit the paths for the build dependencies.
set CUDA_PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0
set CUDNN_PATH=%CUDA_PATH%
set OPENBLAS_PATH=C:\OpenBLAS
set MKL_PATH=C:\Program Files (x86)\Intel\oneAPI\mkl\latest\
set DNNL_PATH=C:\Program Files (x86)\Intel\oneAPI\dnnl\latest\cpu_iomp
set OPENCL_LIB_PATH=%CUDA_PATH%\lib\x64
set OPENCL_INCLUDE_PATH=%CUDA_PATH%\include

rem 3. In most cases you won't need to change anything further down.
echo Deleting build directory:
rd /s build

set CC=icx
set CXX=icx

set WINDRES=rc

set backend=ninja

set BLAS=true
if %MKL%==false if %DNNL%==false if %OPENBLAS%==false if %EIGEN%==false set BLAS=false

if "%CUDA_PATH%"=="%CUDNN_PATH%" (
  set CUDNN_LIB_PATH=%CUDNN_PATH%\lib\x64
  set CUDNN_INCLUDE_PATH=%CUDNN_PATH%\include
) else (
  set CUDNN_LIB_PATH=%CUDA_PATH%\lib\x64,%CUDNN_PATH%\lib\x64
  set CUDNN_INCLUDE_PATH=%CUDA_PATH%\include,%CUDNN_PATH%\include
)

if %CUDNN%==true set PATH=%CUDA_PATH%\bin;%PATH%

meson build --backend %backend% --buildtype release -Ddx=%DX12% -Dcudnn=%CUDNN% -Dplain_cuda=%CUDA% ^
-Dopencl=%OPENCL% -Dblas=%BLAS% -Dmkl=%MKL% -Dopenblas=%OPENBLAS% -Ddnnl=%DNNL% -Dgtest=%TEST% ^
-Dcudnn_include="%CUDNN_INCLUDE_PATH%" -Dcudnn_libdirs="%CUDNN_LIB_PATH%" ^
-Dmkl_include="%MKL_PATH%\include" -Dmkl_libdirs="%MKL_PATH%\lib\intel64" -Ddnnl_dir="%DNNL_PATH%" ^
-Dopencl_libdirs="%OPENCL_LIB_PATH%" -Dopencl_include="%OPENCL_INCLUDE_PATH%" ^
-Dopenblas_include="%OPENBLAS_PATH%\include" -Dopenblas_libdirs="%OPENBLAS_PATH%\lib" ^
-Ddefault_library=static -DUSE_SYCL=true -DUSE_L0_BACKEND=true -Db_vscrt=md -Db_lto=false

if errorlevel 1 exit /b

pause

cd build

ninja -d keeprsp

setlocal EnableDelayedExpansion
for /f "delims=" %%f in (lc0_sycl.exe.rsp) do set x=%%f
set x=!x:"/SUBSYSTEM:CONSOLE"=!
echo !x:/MACHINE:x64=! > lc0_sycl.exe.rsp

icx -fsycl -o lc0_sycl.exe @lc0_sycl.exe.rsp

borg323 commented 10 months ago

I just opened https://github.com/KateBlueSky/lc0/pull/1 with the suggested build system changes for integration in the regular lc0 builds and the above fixes for windows (including the linking hack). With this I was able to compare the output with the check backend against the blas code with satisfactory results:

C:\various\sycl>build\lc0 -w \various\791556.pb.gz -b check -o mode=display,freq=1.0,sycl,blas
       _
|   _ | |
|_ |_ |_| v0.31.0-dev+git.dirty built Nov  1 2023
go
Loading weights file from: \various\791556.pb.gz
Creating backend [check]...
Working backend set to blas.
Reference backend set to sycl.
Creating backend [blas]...
Detected 4 core(s) and 8 thread(s) in 1 group(s).
Group 0 has 4 core(s) and 8 thread(s).
BLAS functions from DNNL version 3.2.0
BLAS max batch size is 256.
Creating backend [sycl]...
GPU: Intel(R) Iris(R) Xe Graphics
GPU memory: 4294959104
GPU clock frequency: 1350
L2 cache capacity: 65536
Global memory Size: 6779285504
Using Fp32
WARNING: Low GPU video memory. You may run into OOM errors. Try using a smaller network.
1024 112
Check mode: error display.
Check rate: 100%.
1024 112
1024 112
1024 112
maximum error for a batch of 1:
  value: absolute: 6.6e-07, relative: 7.3e-06.
  policy: absolute: 2.7e-07, relative: 7.6e-06.
maximum error for a batch of 20:
  value: absolute: 3.9e-06, relative: 1.9e-04.
  policy: absolute: 2.9e-06, relative: 2.6e-05.
info depth 1 seldepth 2 time 32132 nodes 2 score cp 26 nps 1 tbhits 0 pv e2e4 e7e5
maximum error for a batch of 19:
  value: absolute: 3.9e-06, relative: 1.9e-04.
  policy: absolute: 2.9e-06, relative: 2.6e-05.
maximum error for a batch of 20:
  value: absolute: 3.0e-06, relative: 1.2e-05.
  policy: absolute: 4.5e-06, relative: 2.6e-05.
info depth 2 seldepth 3 time 32253 nodes 6 score cp 26 nps 5 tbhits 0 pv e2e4 e7e5 g1f3
maximum error for a batch of 22:
  value: absolute: 2.7e-06, relative: 1.4e-05.
  policy: absolute: 5.0e-06, relative: 3.2e-05.
maximum error for a batch of 8:
  value: absolute: 1.3e-06, relative: 4.1e-05.
  policy: absolute: 3.9e-06, relative: 2.1e-05.
info depth 2 seldepth 4 time 32597 nodes 10 score cp 25 nps 6 tbhits 0 pv e2e4 e7e5 g1f3 g8f6
maximum error for a batch of 9:
  value: absolute: 1.6e-06, relative: 4.1e-05.
  policy: absolute: 4.4e-06, relative: 2.8e-05.
maximum error for a batch of 12:
  value: absolute: 2.1e-06, relative: 2.2e-05.
  policy: absolute: 4.4e-06, relative: 2.3e-05.
info depth 3 seldepth 5 time 32692 nodes 18 score cp 26 nps 11 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4
maximum error for a batch of 29:
  value: absolute: 4.4e-06, relative: 2.2e-05.
  policy: absolute: 4.2e-06, relative: 5.1e-05.
maximum error for a batch of 20:
  value: absolute: 2.4e-06, relative: 3.1e-05.
  policy: absolute: 4.7e-06, relative: 3.5e-05.
info depth 3 seldepth 6 time 32844 nodes 30 score cp 27 nps 17 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4
maximum error for a batch of 10:
  value: absolute: 2.3e-06, relative: 3.1e-05.
  policy: absolute: 6.6e-06, relative: 2.4e-05.
maximum error for a batch of 14:
  value: absolute: 2.7e-06, relative: 3.7e-05.
  policy: absolute: 2.9e-06, relative: 4.1e-05.
info depth 3 seldepth 7 time 32979 nodes 40 score cp 26 nps 21 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3
maximum error for a batch of 15:
  value: absolute: 3.3e-06, relative: 3.9e-05.
  policy: absolute: 5.2e-06, relative: 4.9e-05.
info depth 3 seldepth 8 time 33094 nodes 50 score cp 27 nps 25 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5
maximum error for a batch of 19:
  value: absolute: 1.7e-06, relative: 6.5e-05.
  policy: absolute: 5.7e-06, relative: 5.1e-05.
maximum error for a batch of 20:
  value: absolute: 2.5e-06, relative: 3.5e-04.
  policy: absolute: 4.4e-06, relative: 4.1e-05.
info depth 4 seldepth 8 time 33283 nodes 74 score cp 29 nps 34 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5
maximum error for a batch of 27:
  value: absolute: 3.0e-06, relative: 7.0e-05.
  policy: absolute: 5.7e-06, relative: 3.2e-05.
info depth 4 seldepth 9 time 33306 nodes 97 score cp 28 nps 44 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5
maximum error for a batch of 16:
  value: absolute: 3.2e-06, relative: 3.7e-05.
  policy: absolute: 4.0e-06, relative: 3.4e-05.
maximum error for a batch of 34:
  value: absolute: 2.9e-06, relative: 7.1e-05.
  policy: absolute: 3.3e-06, relative: 5.4e-05.
info depth 4 seldepth 10 time 33503 nodes 173 score cp 27 nps 72 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f3e5 d7d6 e5f3
maximum error for a batch of 31:
  value: absolute: 2.8e-06, relative: 6.2e-04.
  policy: absolute: 5.8e-06, relative: 4.1e-05.
maximum error for a batch of 32:
  value: absolute: 2.7e-06, relative: 4.0e-05.
  policy: absolute: 5.4e-06, relative: 6.1e-05.
maximum error for a batch of 51:
  value: absolute: 4.6e-06, relative: 1.2e-04.
  policy: absolute: 5.4e-06, relative: 5.2e-05.
info depth 5 seldepth 11 time 34198 nodes 302 score cp 27 nps 98 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 b1c3
info depth 5 seldepth 11 time 34298 nodes 334 score cp 27 nps 105 tbhits 0 pv d2d4 g8f6 c2c4 c7c6 b1c3 d7d5 g1f3 e7e6
maximum error for a batch of 29:
  value: absolute: 2.6e-06, relative: 5.1e-05.
  policy: absolute: 5.0e-06, relative: 5.6e-05.
maximum error for a batch of 31:
  value: absolute: 3.9e-06, relative: 7.3e-05.
  policy: absolute: 6.9e-06, relative: 6.5e-05.
info depth 5 seldepth 12 time 34729 nodes 376 score cp 29 nps 104 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 b1c3
maximum error for a batch of 84:
  value: absolute: 3.1e-06, relative: 3.2e-04.
  policy: absolute: 8.8e-06, relative: 5.1e-05.
info depth 5 seldepth 12 time 35184 nodes 520 score cp 27 nps 128 tbhits 0 pv d2d4 g8f6 c2c4 c7c6 b1c3 d7d5 g1f3 e7e6 c1g5
maximum error for a batch of 25:
  value: absolute: 3.4e-06, relative: 6.6e-05.
  policy: absolute: 4.8e-06, relative: 4.7e-05.
info depth 5 seldepth 13 time 35189 nodes 529 score cp 29 nps 130 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 e5d7 c8d7
maximum error for a batch of 37:
  value: absolute: 3.5e-06, relative: 2.2e-04.
  policy: absolute: 5.4e-06, relative: 7.3e-05.
info depth 5 seldepth 14 time 35648 nodes 578 score cp 28 nps 127 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 e5d7 c8d7
maximum error for a batch of 29:
  value: absolute: 2.8e-06, relative: 7.9e-05.
  policy: absolute: 7.1e-06, relative: 4.9e-05.
info depth 5 seldepth 15 time 36176 nodes 615 score cp 28 nps 121 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 e5d7 c8d7
info depth 5 seldepth 15 time 36184 nodes 637 score cp 27 nps 125 tbhits 0 pv d2d4 g8f6 c2c4 c7c6 b1c3 d7d5 g1f3 e7e6 e2e3 b8d7
maximum error for a batch of 34:
  value: absolute: 5.0e-06, relative: 3.6e-04.
  policy: absolute: 6.9e-06, relative: 4.3e-05.
info depth 5 seldepth 16 time 36491 nodes 671 score cp 27 nps 124 tbhits 0 pv d2d4 g8f6 c2c4 c7c6 b1c3 d7d5 g1f3 e7e6 e2e3 b8d7
maximum error for a batch of 130:
  value: absolute: 6.7e-06, relative: 3.3e-04.
  policy: absolute: 7.3e-06, relative: 5.8e-05.
info depth 5 seldepth 16 time 36892 nodes 808 score cp 27 nps 140 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 b1c3 d7e5 d4e5
maximum error for a batch of 30:
  value: absolute: 5.4e-06, relative: 5.0e-04.
  policy: absolute: 8.3e-06, relative: 5.4e-05.
info depth 6 seldepth 16 time 36909 nodes 897 score cp 27 nps 154 tbhits 0 pv e2e4 e7e5 g1f3 g8f6 d2d4 f6e4 f1d3 d7d5 f3e5 b8d7 b1c3 d7e5 d4e5
maximum error for a batch of 76:
  value: absolute: 5.3e-06, relative: 6.2e-04.
  policy: absolute: 6.7e-06, relative: 4.7e-05.
info depth 6 seldepth 17 time 37500 nodes 1001 score cp 27 nps 156 tbhits 0 pv d2d4 g8f6 c2c4 c7c6 g1f3 d7d5 b1c3 e7e6 e2e3 b8d7 d1c2 f8d6
maximum error for a batch of 65:
  value: absolute: 3.6e-06, relative: 2.8e-04.
  policy: absolute: 8.7e-06, relative: 8.7e-05.
info depth 6 seldepth 18 time 37897 nodes 1085 score cp 27 nps 160 tbhits 0 pv d2d4 g8f6 c2c4 c7c6 g1f3 d7d5 b1c3 e7e6 e2e3 b8d7 d1c2 f8d6
maximum error for a batch of 39:
  value: absolute: 4.5e-06, relative: 1.4e-03.
  policy: absolute: 6.1e-06, relative: 3.7e-05.
info depth 6 seldepth 19 time 38141 nodes 1132 score cp 27 nps 161 tbhits 0 pv d2d4 g8f6 c2c4 c7c6 g1f3 d7d5 b1c3 e7e6 e2e3 b8d7 d1c2 f8d6
maximum error for a batch of 256:
  value: absolute: 5.2e-06, relative: 1.3e-02.
  policy: absolute: 9.1e-06, relative: 9.8e-05.

borg323 commented 10 months ago

BTW, I can't get sycl-fp16 to work, I get an exception, either Native API failed. Native API returns: -997 (Command failed to enqueue/execute) or Native API failed. Native API returns: -1 (PI_ERROR_DEVICE_NOT_FOUND) depending on which network file I try. Any ideas?

KateBlueSky commented 10 months ago

Hi @borg323 what kind of GPU are you using? I just want to confirm it supports fp16.

borg323 commented 10 months ago

It is a Tiger Lake iGPU. It does support fp16, I have successfully used it with onnx and onednn.

KateBlueSky commented 10 months ago

@borg323 did you use the sycl compiler from OneAPI base kit? For Intel GPU you'll want to use sycl compiler that comes from OneAPI base kit on the Intel website. For Nvidia and AMD you'll want use the opensource SYCL compiler intel/llvm. Sorry, if this seems confusing but it's because the SYCL compiler in the OneAPI base kit has the L0 backend runtime that is needed for Intel GPUs.

borg323 commented 10 months ago

Yes, the OneAPI base kit is the one I use, and it works quite well in fp32 mode - I seem to get better performance with sycl and fp32 than I get with onednn and fp16.

borg323 commented 10 months ago

I found that fp16 is crashing because of an unimplemented feature, and opened https://github.com/KateBlueSky/lc0/pull/2 with this and some other changes. This is not enough for fix fp16 but at least it doesn't crash any more, just returns many NaNs, which allows a more detailed investigation. Doing this I found the InputTransform kernel is not working for fp16, but haven't managed to get to the bottom of it. Any ideas?

borg323 commented 10 months ago

No progress with the fp16 issues, but at least I have a more stable solution for the build problems on windows: https://github.com/KateBlueSky/lc0/pull/3

borg323 commented 10 months ago

In the original iteration of this, there was support for USE_AMD in meson.build. Is this something worth considering as a separate backend?

KateBlueSky commented 10 months ago

In the original iteration of this, there was support for USE_AMD in meson.build. Is this something worth considering as a separate backend?

This is an interesting question.. The sycl code should be able to run on Nvidia, AMD, and Intel. The USE_AMD flag in the SYCL code was to run SYCL on AMD and allow sycl to make interop calls in the hipBlas libraries for the sgemm operations. The flag that gets used for Nvidia SYCL is USE_CUBLAS, the default is Intel. If you look at the code here https://github.com/KateBlueSky/lc0/blob/2dd9393de655eb02a5c75473aa108f0c7c40d09a/src/neural/sycl/layers.cc.dp.cpp#L585 you'll see how this is being done. In theory with SYCL you can write once and run anywhere and that it should be device independent, but in practice a person needs to write device specific code when making calls to the device's native libraries. The other thing too is if you just want a pure AMD/HIP version without the SYCL code then you should take a look at using the AMD hipfy tool and running the CUDA version through it. The hipfy tool does a pretty good job with the conversion.

On another note, I'm still looking at this fp16 problem and should have an update tomorrow for it. I think one of the buffers is messed up when making an sgemm call.

borg323 commented 10 months ago

On another note, I'm still looking at this fp16 problem and should have an update tomorrow for it. I think one of the buffers is messed up when making an sgemm call.

If you can, please join our discord chat http://lc0.org/chat so we can discuss this in detail (either publicly or using direct messages), I'm seeing issues a lot earlier than the first gemm call. The short version is I disabled everything in the fp16 pipeline and the first test was successful passing data to the shared buffers and back, then added a gemm call and that one seemed OK as well. The issues were traced to the first custom kernel called, namely InputTransform (calling just that one) returning NaNs (IIRC, I don't have my notes here).

KateBlueSky commented 10 months ago

Canceling this pull request for now. Needs more testing and fixes, I don't have the time to dedicate to it now. Might revisit the SYCL code contribution in the future with fixes.

borg323 commented 10 months ago

Since it works quite well for fp32, I would keep it open as a draft PR for the time being.

borg323 commented 7 months ago

Might revisit the SYCL code contribution in the future with fixes.

Is there any progress here? If not, would you mind if I upstream a subset of this?

KateBlueSky commented 7 months ago

Hi, no I've not made any progress on it. I've not tested or worked on the code since this PR. I'm not sure what you mean by "upstream" but you're more than welcome to use any of the code that was commited in this PR if you find it useful.

LeelaChessZero / lc0

sycl backend support for AMD, NVIDIA, and Intel #1925