Open chsasank opened 7 months ago
Rebuilt with dpcpp for the sake for sanity
$ echo $CC $CXX
/opt/intel/oneapi/2024.0/bin/compiler/clang /opt/intel/oneapi/2024.0/bin/compiler/clang++
cmake -GNinja ../ -DSYCL_COMPILER=dpcpp -DDPCPP_SYCL_ARCH=sm_75 -DDPCPP_SYCL_TARGET=nvptx64-nvidia-cuda -DTUNING_TARGET=NVIDIA_GPU -DCMAKE_BUILD_TYPE=Release
ninja
Run benchmark:
# For all GPUs
cat << EOF > params.csv
n,n,1024,1024,1024,1,0
n,n,2048,2048,2048,1,0
n,n,4096,4096,4096,1,0
EOF
./benchmark/portblas/bench_gemm --csv-param params.csv --benchmark_out=../results.json \
--benchmark_out_format=json --benchmark_format=console
Output:
Device vendor: NVIDIA Corporation
Device name: NVIDIA GeForce GTX 1650 Ti
Device type: gpu
2024-03-25T17:27:34+05:30
Running ./benchmark/portblas/bench_gemm
Run on (12 X 3000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 512 KiB (x6)
L3 Unified 4096 KiB (x2)
Load Average: 1.76, 4.02, 6.45
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_Gemm<float>/n/n/1024/1024/1024/buffer/real_time 1727943 ns 1721562 ns 366 avg_event_time=1.70028M avg_overall_time=1.72208M batch_size=1 best_event_time=1.4602M best_overall_time=1.48428M beta=0 bytes_per_second=6.78191G/s bytes_processed=12.5829M items_per_second=1.2434T/s k=1024 m=1024 n=1024 n_fl_ops=2.14853G total_event_time=622.301M total_overall_time=630.281M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time 7738029 ns 7737817 ns 86 avg_event_time=7.71147M avg_overall_time=7.73214M batch_size=1 best_event_time=7.52393M best_overall_time=7.54481M beta=0 bytes_per_second=6.05774G/s bytes_processed=50.3316M items_per_second=2.22073T/s k=2.048k m=2.048k n=2.048k n_fl_ops=17.1841G total_event_time=663.187M total_overall_time=664.964M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
BM_Gemm<float>/n/n/4096/4096/4096/buffer/real_time 57458137 ns 57418006 ns 12 avg_event_time=57.4272M avg_overall_time=57.4517M batch_size=1 best_event_time=56.2959M best_overall_time=56.3316M beta=0 bytes_per_second=3.26325G/s bytes_processed=201.327M items_per_second=2.39228T/s k=4.096k m=4.096k n=4.096k n_fl_ops=137.456G total_event_time=689.126M total_overall_time=689.42M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
BM_Gemm<float>/n/n/1024/1024/1024/usm/real_time 1732308 ns 1730963 ns 404 avg_event_time=1.71372M avg_overall_time=1.72614M batch_size=1 best_event_time=1.45508M best_overall_time=1.46836M beta=0 bytes_per_second=6.76482G/s bytes_processed=12.5829M items_per_second=1.24027T/s k=1024 m=1024 n=1024 n_fl_ops=2.14853G total_event_time=692.342M total_overall_time=697.36M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time 7743868 ns 7740781 ns 88 avg_event_time=7.72484M avg_overall_time=7.73773M batch_size=1 best_event_time=7.55957M best_overall_time=7.57233M beta=0 bytes_per_second=6.05318G/s bytes_processed=50.3316M items_per_second=2.21905T/s k=2.048k m=2.048k n=2.048k n_fl_ops=17.1841G total_event_time=679.786M total_overall_time=680.92M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
BM_Gemm<float>/n/n/4096/4096/4096/usm/real_time 58875445 ns 58867885 ns 12 avg_event_time=58.8555M avg_overall_time=58.8689M batch_size=1 best_event_time=56.9023M best_overall_time=56.9161M beta=0 bytes_per_second=3.18469G/s bytes_processed=201.327M items_per_second=2.33469T/s k=4.096k m=4.096k n=4.096k n_fl_ops=137.456G total_event_time=706.266M total_overall_time=706.426M @backend=portBLAS,@datatype=float,@library=portBLAS,device_name=NVIDIA GeForce GTX 1650 Ti,device_version=7.5,driver_version=CUDA 12.3,git_hash=eff2458042246830fe35feec38c240a86a282d0a,git_hash_date=2024-03-04 13:40:34 +0000,vendor_name=NVIDIA Corporation
Summary:
test_name, gflops
BM_Gemm<float>/n/n/1024/1024/1024/buffer/real_time 1243 1243
BM_Gemm<float>/n/n/2048/2048/2048/buffer/real_time 2220 2221
BM_Gemm<float>/n/n/4096/4096/4096/buffer/real_time 2392 2392
BM_Gemm<float>/n/n/1024/1024/1024/usm/real_time 1240 1240
BM_Gemm<float>/n/n/2048/2048/2048/usm/real_time 2219 2219
BM_Gemm<float>/n/n/4096/4096/4096/usm/real_time 2334 2335
Which is quite similar to the above. Something is slowing down acpp.
It's an Intel/Codeplay library. Obviously the focus of optimization and validation was on DPC++.
To their credit, at least they tried to make it work, which cannot be said of all of the oneAPI SYCL libraries. Those don't even try to support anything but DPC++.
There seem to be a couple of AdaptiveCpp-specific code paths in portBLAS, so the executed code won't be the same: https://github.com/search?q=repo%3Acodeplaysoftware%2FportBLAS+__ADAPTIVECPP__&type=code
I don't know why exactly they are needed. Some seem to be to work around that the generic SSCP compiler does not yet implement the SYCL 2020 group algorithms library.
If there are no other issues and it turns out that it is bound by group algorithm performance we could close this issue as it's a known limitation and on the todo list.
There seem to be a couple of AdaptiveCpp-specific code paths in portBLAS, so the executed code won't be the same: https://github.com/search?q=repo%3Acodeplaysoftware%2FportBLAS+__ADAPTIVECPP__&type=code
I didn't realize this and assumed it's the same code.
Some seem to be to work around that the generic SSCP compiler does not yet implement the SYCL 2020 group algorithms library.
Is there a checklist of the things that are implemented and that are not?
Is there a checklist of the things that are implemented and that are not?
Compared to the older compilation flows, it's really only the SYCL 2020 group algorithm library, and SYCL 2020 reductions. The latter of which are also not fully implemented in the old SMCP compilers.
Plus some less important features: The scoped parallelism extension, and the hierarchical parallelism model which was explicitly discouraged in the SYCL 2020 spec and which is likely to be removed in future SYCL versions.
On the other hand, the SSCP compiler supports functionality that the old compilers do not implement such as SYCL_EXTERNAL
.
Compared to DPC++... this is actually a contentious issue. There is no consensus between implementations about which features from SYCL 2020 are actually portable and implementable across implementations. DPC++ implements some functionality that is in SYCL 2020 that was merged without any prior implementation experience and only makes sense for DPC++..
Hi,
I've just built adaptivecpp on Nvidia GPU and then built PortBLAS. Compared the benchmarks to dpcpp.
Install dependencies for PortBLAS:
Build portBLAS with acpp:
To benchmark, I ran:
This would yield following results:
To summarize this:
Data from my previous run with dpc++
References: https://chsasank.com/portblas-portable-blas-across-gpus.html