Low performance on PortBLAS vs DPC++

Hi,

I've just built adaptivecpp on Nvidia GPU and then built PortBLAS. Compared the benchmarks to dpcpp.

git clone --recursive https://github.com/codeplaysoftware/portBLAS.git
cd portBLAS
mkdir build && cd build

Install dependencies for PortBLAS:

sudo apt install -y ninja-build libopenblas-dev

Build portBLAS with acpp:

export CC=clang-16
export CXX=acpp
export ACPP_TARGETS=generic
# select the right target
cmake -GNinja -DTUNING_TARGET=NVIDIA_GPU -DCMAKE_BUILD_TYPE=Release \
      -DSYCL_COMPILER=adaptivecpp -DACPP_TARGETS=$ACPP_TARGETS \
      -DBLAS_ENABLE_BENCHMARK=ON ..
ninja

To benchmark, I ran:

# For all GPUs
cat << EOF > params.csv
n,n,1024,1024,1024,1,0
n,n,2048,2048,2048,1,0
n,n,4096,4096,4096,1,0
EOF
./benchmark/portblas/bench_gemm --csv-param params.csv --benchmark_out=../results.json \
    --benchmark_out_format=json --benchmark_format=console

This would yield following results:

[AdaptiveCpp Warning] from /home/sasank/code/AdaptiveCpp/src/runtime/ocl/ocl_hardware_manager.cpp:549 @ ocl_hardware_manager(): ocl_hardware_manager: Could not obtain platform list (error code = CL:-1001)       
Device vendor: NVIDIA                                                                                                                                                                                              
Device name: NVIDIA GeForce GTX 1650 Ti                                                                                                                                                                            
Device type: gpu                                                                                                                                                                                                   
2024-03-25T16:44:54+05:30                                                                                                                                                                                          
Running ./benchmark/portblas/bench_gemm                                                                                                                                                                            
Run on (12 X 3000 MHz CPU s)                                                                                                                                                                                       
CPU Caches:                                                                                                                                                                                                        
  L1 Data 32 KiB (x6)                                                                                                                                                                                              
  L1 Instruction 32 KiB (x6)                                                                                                                                                                                       
  L2 Unified 512 KiB (x6)                                                                                                                                                                                          
  L3 Unified 4096 KiB (x2)                                                                                                                                                                                         
Load Average: 1.04, 2.62, 4.35                                                                                                                                                                                     
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.                                                                                             
-------------------------------------------------------------------------------------------------------------                                                                                                      
Benchmark                                                   Time             CPU   Iterations UserCounters...                                                                                                      
-------------------------------------------------------------------------------------------------------------                                                                                                      
BM_Gemm<float>/n/n/1024/1024/1024/buffer/real_time    2667278 ns      2448435 ns          230 avg_event_time=2.47693M avg_overall_time=2.65955M batch_size=1 best_event_time=2.40205M best_overall_time=2.43685M be
ta=0 bytes_per_second=4.39352G/s bytes_processed=12.5829M items_per_second=805.515G/s k=1024 m=1024 n=1024 n_fl_ops=2.14853G total_event_time=569.694M total_overall_time=611.695M @backend=portBLAS,@datatype=floa
t,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=sm_75,driver_version=12030,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time   20015802 ns     19772003 ns           35 avg_event_time=19.8115M avg_overall_time=20.0079M batch_size=1 best_event_time=19.6204M best_overall_time=19.7431M be
ta=0 bytes_per_second=2.3419G/s bytes_processed=50.3316M items_per_second=858.525G/s k=2.048k m=2.048k n=2.048k n_fl_ops=17.1841G total_event_time=693.404M total_overall_time=700.277M @backend=portBLAS,@datatype
=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=sm_75,driver_version=12030,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVI
DIA
BM_Gemm<float>/n/n/4096/4096/4096/buffer/real_time  159947903 ns    159755614 ns            4 avg_event_time=159.796M avg_overall_time=159.937M batch_size=1 best_event_time=159.043M best_overall_time=159.211M be
ta=0 bytes_per_second=1.17226G/s bytes_processed=201.327M items_per_second=859.378G/s k=4.096k m=4.096k n=4.096k n_fl_ops=137.456G total_event_time=639.185M total_overall_time=639.749M @backend=portBLAS,@datatyp
e=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=sm_75,driver_version=12030,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NV
IDIA

To summarize this:

test_name, gflops
BM_Gemm<float>/n/n/1024/1024/1024/buffer/real_time 834 835
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time 863 863
BM_Gemm<float>/n/n/4096/4096/4096/buffer/real_time 863 863

Data from my previous run with dpc++

test_name, gflops
BM_Gemm<float>/n/n/1024/1024/1024/buffer/real_time 1284 1285
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time 2299 2300
BM_Gemm<float>/n/n/4096/4096/4096/buffer/real_time 2475 2476
BM_Gemm<float>/n/n/1024/1024/1024/usm/real_time 1281 1282
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time 2295 2295
BM_Gemm<float>/n/n/4096/4096/4096/usm/real_time 2413 2413

References: https://chsasank.com/portblas-portable-blas-across-gpus.html

Rebuilt with dpcpp for the sake for sanity

$ echo $CC $CXX
/opt/intel/oneapi/2024.0/bin/compiler/clang /opt/intel/oneapi/2024.0/bin/compiler/clang++

cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_ARCH=sm_75 -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DTUNING_TARGET=NVIDIA_GPU -DCMAKE_BUILD_TYPE=Release
ninja

Run benchmark:

# For all GPUs
cat << EOF > params.csv
n,n,1024,1024,1024,1,0
n,n,2048,2048,2048,1,0
n,n,4096,4096,4096,1,0
EOF
./benchmark/portblas/bench_gemm --csv-param params.csv --benchmark_out=../results.json \
    --benchmark_out_format=json --benchmark_format=console

Output:

Device vendor: NVIDIA Corporation
Device name: NVIDIA GeForce GTX 1650 Ti
Device type: gpu
2024-03-25T17:27:34+05:30
Running ./benchmark/portblas/bench_gemm
Run on (12 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 512 KiB (x6)
  L3 Unified 4096 KiB (x2)
Load Average: 1.76, 4.02, 6.45
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_Gemm<float>/n/n/1024/1024/1024/buffer/real_time    1727943 ns      1721562 ns          366 avg_event_time=1.70028M avg_overall_time=1.72208M batch_size=1 best_event_time=1.4602M best_overall_time=1.48428M beta=0 bytes_per_second=6.78191G/s bytes_processed=12.5829M items_per_second=1.2434T/s k=1024 m=1024 n=1024 n_fl_ops=2.14853G total_event_time=622.301M total_overall_time=630.281M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time    7738029 ns      7737817 ns           86 avg_event_time=7.71147M avg_overall_time=7.73214M batch_size=1 best_event_time=7.52393M best_overall_time=7.54481M beta=0 bytes_per_second=6.05774G/s bytes_processed=50.3316M items_per_second=2.22073T/s k=2.048k m=2.048k n=2.048k n_fl_ops=17.1841G total_event_time=663.187M total_overall_time=664.964M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
BM_Gemm<float>/n/n/4096/4096/4096/buffer/real_time   57458137 ns     57418006 ns           12 avg_event_time=57.4272M avg_overall_time=57.4517M batch_size=1 best_event_time=56.2959M best_overall_time=56.3316M beta=0 bytes_per_second=3.26325G/s bytes_processed=201.327M items_per_second=2.39228T/s k=4.096k m=4.096k n=4.096k n_fl_ops=137.456G total_event_time=689.126M total_overall_time=689.42M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
BM_Gemm<float>/n/n/1024/1024/1024/usm/real_time       1732308 ns      1730963 ns          404 avg_event_time=1.71372M avg_overall_time=1.72614M batch_size=1 best_event_time=1.45508M best_overall_time=1.46836M beta=0 bytes_per_second=6.76482G/s bytes_processed=12.5829M items_per_second=1.24027T/s k=1024 m=1024 n=1024 n_fl_ops=2.14853G total_event_time=692.342M total_overall_time=697.36M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time       7743868 ns      7740781 ns           88 avg_event_time=7.72484M avg_overall_time=7.73773M batch_size=1 best_event_time=7.55957M best_overall_time=7.57233M beta=0 bytes_per_second=6.05318G/s bytes_processed=50.3316M items_per_second=2.21905T/s k=2.048k m=2.048k n=2.048k n_fl_ops=17.1841G total_event_time=679.786M total_overall_time=680.92M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
BM_Gemm<float>/n/n/4096/4096/4096/usm/real_time      58875445 ns     58867885 ns           12 avg_event_time=58.8555M avg_overall_time=58.8689M batch_size=1 best_event_time=56.9023M best_overall_time=56.9161M beta=0 bytes_per_second=3.18469G/s bytes_processed=201.327M items_per_second=2.33469T/s k=4.096k m=4.096k n=4.096k n_fl_ops=137.456G total_event_time=706.266M total_overall_time=706.426M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation

Summary:

test_name, gflops
BM_Gemm<float>/n/n/1024/1024/1024/buffer/real_time 1243 1243
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time 2220 2221
BM_Gemm<float>/n/n/4096/4096/4096/buffer/real_time 2392 2392
BM_Gemm<float>/n/n/1024/1024/1024/usm/real_time 1240 1240
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time 2219 2219
BM_Gemm<float>/n/n/4096/4096/4096/usm/real_time 2334 2335

Which is quite similar to the above. Something is slowing down acpp.

It's an Intel/Codeplay library. Obviously the focus of optimization and validation was on DPC++.

To their credit, at least they tried to make it work, which cannot be said of all of the oneAPI SYCL libraries. Those don't even try to support anything but DPC++.

There seem to be a couple of AdaptiveCpp-specific code paths in portBLAS, so the executed code won't be the same: https://github.com/search?q=repo%3Acodeplaysoftware%2FportBLAS+__ADAPTIVECPP__&type=code

I don't know why exactly they are needed. Some seem to be to work around that the generic SSCP compiler does not yet implement the SYCL 2020 group algorithms library.

If there are no other issues and it turns out that it is bound by group algorithm performance we could close this issue as it's a known limitation and on the todo list.

There seem to be a couple of AdaptiveCpp-specific code paths in portBLAS, so the executed code won't be the same: https://github.com/search?q=repo%3Acodeplaysoftware%2FportBLAS+__ADAPTIVECPP__&type=code

I didn't realize this and assumed it's the same code.

Some seem to be to work around that the generic SSCP compiler does not yet implement the SYCL 2020 group algorithms library.

Is there a checklist of the things that are implemented and that are not?

Is there a checklist of the things that are implemented and that are not?

Compared to the older compilation flows, it's really only the SYCL 2020 group algorithm library, and SYCL 2020 reductions. The latter of which are also not fully implemented in the old SMCP compilers.

Plus some less important features: The scoped parallelism extension, and the hierarchical parallelism model which was explicitly discouraged in the SYCL 2020 spec and which is likely to be removed in future SYCL versions.

On the other hand, the SSCP compiler supports functionality that the old compilers do not implement such as SYCL_EXTERNAL.

Compared to DPC++... this is actually a contentious issue. There is no consensus between implementations about which features from SYCL 2020 are actually portable and implementable across implementations. DPC++ implements some functionality that is in SYCL 2020 that was merged without any prior implementation experience and only makes sense for DPC++..

AdaptiveCpp / AdaptiveCpp

Low performance on PortBLAS vs DPC++ #1417