clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
839 stars 240 forks source link

sgemm with 15548cf is 30X slower than with libclBLAS.so.2.3 #239

Closed matejaputic closed 8 years ago

matejaputic commented 8 years ago

Using workload size M=N=K=8192, clblasSgemm is about 3X slower with libclBLAS.so.2.11 than with libclBLAS.so.2.3

To recreate, apply my patch to src/samples/example_sgemm.c from https://gist.github.com/matejaputic/6fcf4cf0ee872c8c57e8

The only modification is that the workload size is 8192x8192x8129 instead of 3x4x5, and memory for the buffers is allocated on the heap.

I profiled the GPU counters with AMD's CodeXL sprofile command on an AMD FirePro W9100 card, the outputs are here for both libclBLAS.so.2.3 and libclBLAS.so.2.11 https://gist.github.com/matejaputic/6437e74ac7064e12aa77

The 2.11 version dispatches four runs of the same kernel, one of which runs for about 15 seconds, the others which run for between 3 and 20 ms. The 2.3 version uses only a single kernel run, which runs in about 535 ms.

Please let me know if I can provide additional info. I realize this is only a single datapoint, but I thought it significant enough to bring it to the attention of the dev team.

Am I doing anything wrong here?

tingxingdong commented 8 years ago

we did observed the SGEMM declined sharply at particular size 2048, 4096, 8192, before.

The four kernel was introduced to "mitigate" the decline. Timmy can tell more, but you can test other size,like 8000 to see if it is still slow.

On Fri, Mar 11, 2016 at 7:31 PM, Mateja Putic notifications@github.com wrote:

Using workload size M=N=K=8192, clblasSgemm is about 3X slower with libclBLAS.so.2.11 than with libclBLAS.so.2.3

To recreate, apply my patch to src/samples/example_sgemm.c from https://gist.github.com/matejaputic/6fcf4cf0ee872c8c57e8

The only modification is that the workload size is 8192x8192x8129 instead of 3x4x5, and memory for the buffers is allocated on the heap.

I profiled the GPU counters with AMD's CodeXL sprofile command on an AMD FirePro W9100 card, the outputs are here for both libclBLAS.so.2.3 and libclBLAS.so.2.11 https://gist.github.com/matejaputic/6437e74ac7064e12aa77

The 2.11 version dispatches four runs of the same kernel, one of which runs for about 15 seconds, the others which run for between 3 and 20 ms. The 2.3 version uses only a single kernel run, which runs in about 535 ms.

Please let me know if I can provide additional info. I realize this is only a single datapoint, but I thought it significant enough to bring it to the attention of the dev team.

Am I doing anything wrong here?

— Reply to this email directly or view it on GitHub https://github.com/clMathLibraries/clBLAS/issues/239.

Tingxing dong

TimmyLiu commented 8 years ago

another thing to check is make sure you have selected OpenCL 2.0 compiler from CMAKE here. you want to set OPENCL_VERSION as 2.0. By default 1.2 compiler is selected and does use a lot more registers.

on a second note: your vgpr count for the first kernel is 228. I do think it should be less than 64 if the right compiler is used.

matejaputic commented 8 years ago

Thanks, Timmy. I am going to have to look into this further.

With OpenCL 2.0, the same benchmark segfaults in clBuildProgram when called from makeGemmKernel. I will follow up with this. In the mean time, here is the stack trace.

https://gist.github.com/matejaputic/5b7509ce564befd70454

guacamoleo commented 8 years ago

No more complaints in over a week. I'll close it out assuming it was that compiler issue.