clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
839 stars 240 forks source link

AutoGEMM problem(s) on NVIDIA Hardware #220

Closed naumb closed 8 years ago

naumb commented 8 years ago

I'm trying to compare the clBLAS library with cuBLAS on an Nvidia K20c. Building clBLAS turns out to be fine so far, the measurements with the included client are resulting in about 610 GFLOPS (for dgemm, m, n and k are 4096 for this case), which is about 55% peak performance. I've been trying to improve this result by using the AutoGEMM Profiler as stated in the wiki to use the fastest kernels on this GPU. After getting it to work for NVIDIA OpenCL Platforms (by changing the hard-coded PLATFORM_NAME in the code of the profiler) it seemed to be working, but I am not quite sure if the output file is complete, because it seems to be missing some data. The following lines are directly taken from the file prof_sgemm_ksr.txt generated by the Autogemm_Tools_Profile executable:

[  384, [ 48, 48], [ [ 64, 64], [ 48, 48] ] ],
[  400, [ 32, 32], [ [ 64, 64], [ 48, 48], [ 80, 80] ] ],
[  416, [ 32, 32], [ [ 64, 64], [ 48, 48], [ 80, 80], [ 32, 32] ] ],
[  464, [ 80, 80], [ [ 64, 64], [ 48, 48], [ 80, 80], [ 32, 32] ] ],

According to the wiki this should be the kernel selection data (for sgemm in this case), but compared to the included kernelSelectionData in AutoGemmParameters.py, there are 2 elements missing in each inner list (taken from kernelSelectionDataHawaii):

[ 4000, [ 16, 16,  6,  6], [ [ 16, 16,  6,  6] ] ],
[ 2496, [ 16, 16,  4,  4], [ [ 16, 16,  6,  6], [ 16, 16,  4,  4] ] ],
[ 2448, [ 16, 16,  6,  6], [ [ 16, 16,  6,  6] ] ],

Correct me if I'm wrong, but I think that the values for the microtile dimensions are missing. I am not quite sure if this related to the fact that I'm using NVIDIA Hardware, but I don't know what to do with the generated data at this point. I cant include this data to the AutoGemmParameters.py script, because it references the missing elements.

tl;dr: Is it possible to use AutoGEMM on NVIDIA Hardware or am I "stuck" with the default kernels?

Edit: I just realised I forgot to mention: I'm using a CentOS 7.1 System, if that is relevant for this problem.

tingxingdong commented 8 years ago

Hi, Billy:

I guess I can use a few words to summary what you are doing.

        Tuning AutoGEMM on Nvidia card.

Regarding to two elements missing. you want to change them like this

[ 384, [ 48, 48], [ [ 64, 64], [ 48, 48] ] ], --> [ 384, [ 16, 16, 3, 3 ], [ [ 16, 16, 4, 4], [ 16, 16, 3, 3] ] ],

On Wed, Jan 20, 2016 at 6:17 AM, Billy Naumann notifications@github.com wrote:

I'm trying to compare the clBLAS library with cuBLAS on an Nvidia K20c Building clBLAS turns out to be fine so far, the measurements with the included client are resulting in about 610 GFLOPS (for dgemm, m, n and k are 4096 for this case), which is about 55% peak performance I've been trying to improve this result by using the AutoGEMM Profiler as stated in the wiki to use the fastest kernels on this GPU After getting it to work for NVIDIA OpenCL Platforms (by changing the hard-coded PLATFORM_NAME in the code of the profiler) it seemed to be working, but I am not quite sure if the output file is complete, because it seems to be missing some data The following lines are directly taken from the file prof_sgemm_ksrtxt generated by the Autogemm_Tools_Profile executable:

[ 384, [ 48, 48], [ [ 64, 64], [ 48, 48] ] ], [ 400, [ 32, 32], [ [ 64, 64], [ 48, 48], [ 80, 80] ] ], [ 416, [ 32, 32], [ [ 64, 64], [ 48, 48], [ 80, 80], [ 32, 32] ] ], [ 464, [ 80, 80], [ [ 64, 64], [ 48, 48], [ 80, 80], [ 32, 32] ] ],

According to the wiki this should be the kernel selection data (for sgemm in this case), but compared to the included kernelSelectionData in AutoGemmParameterspy, there are 2 elements missing in each inner list (taken from kernelSelectionDataHawaii):

[ 4000, [ 16, 16, 6, 6], [ [ 16, 16, 6, 6] ] ], [ 2496, [ 16, 16, 4, 4], [ [ 16, 16, 6, 6], [ 16, 16, 4, 4] ] ], [ 2448, [ 16, 16, 6, 6], [ [ 16, 16, 6, 6] ] ],

Correct me if I'm wrong, but I think that the values for the microtile dimensions are missing I am not quite sure if this related to the fact that I'm using NVIDIA Hardware, but I don't know what to do with the generated data at this point I cant include this data to the AutoGemmParameterspy script, because it references the missing elements

tl;dr: Is it possible to use AutoGEMM on NVIDIA Hardware or am I "stuck" with the default kernels?

— Reply to this email directly or view it on GitHub https://github.com/clMathLibraries/clBLAS/issues/220.

Tingxing dong

naumb commented 8 years ago

Thank you for your reply,

Tuning AutoGEMM on Nvidia card.

I guess that pretty much sums it up :+1:

After changing the lists as you described, the AutoGemmParameters script runs fine so far - everything builds without any errors. Unfortunately, the new kernels don't bring the improvements I was hoping for. Instead the sgemm performance drops from about ~1 TFLOPS down to ~200 GFLOPS (m, n and k are 4096 in this test, too).

Are there any other assumptions made within the profiler-code that could interfere with my Nvidia hardware?

tingxingdong commented 8 years ago

In tuning GEMM. you should get two files. prof_sgemm_ksr.txt and prof_sgemm_raw.csv.

check the gflop/s number in .csv file of 4096 to see if it is 200?

On Thu, Jan 21, 2016 at 4:47 AM, Billy Naumann notifications@github.com wrote:

Thank you for your reply,

Tuning AutoGEMM on Nvidia card.

I guess that pretty much sums it up [image: :+1:]

After changing the lists as you described, the AutoGemmParameters script runs fine so far - everything builds without any errors. Unfortunately, the new kernels don't bring the improvements I was hoping for. Instead the sgemm performance drops from about ~1 TFLOPS down to ~200 GFLOPS (m, n and k are 4096 in this test, too).

Are there any other assumptions made within the profiler-code that could interfere with my Nvidia hardware?

— Reply to this email directly or view it on GitHub https://github.com/clMathLibraries/clBLAS/issues/220#issuecomment-173534439 .

Tingxing dong

naumb commented 8 years ago

There are multiple numbers in this file, I'm assuming for each tile size and unroll value. The line itself looks like this:

4096, 4096, 59.8307, 220.849, 240.196, 212.161, 461.437, 439.214, 334.463, 577.598, 590.172, 463.61, 741.16, 790.162, 556.23, 813.633, 941.988, 533.845, 1034.45, 1044.49, <-F|T->, 60.2822, 222.991, 240.973, 217.049, 466.821, 441.828, 0, 0, 0, 474.877, 760.203, 809.713, 0, 0, 0, 0, 0, 0, 96x96,

I've attached both generated outputfiles for comparison. For larger tile-sizes and unrolling-values the gflops value seems to be mostly increasing, even to a comparable level to the "original" input.

prof_sgemm_ksr.txt prof_sgemm_raw.csv.txt

Edit: I've got this problem done. I've seen it a few days ago, but i thought it wouldn't matter: It is necessary to sort the kernelSelectionData by the size in descending order. I'm now achieving around 1.05 TFLOPS in this testcase, which is way better that the ~200, but compared to the default results still no improvement.

tingxingdong commented 8 years ago

Yes, here the number means the gflop/s. so the largest one is 1044.49 at 96*96, which means, your should be able to obtain 1044.49 after you finalized your setting.

Yet, I courage you to try another matrix size other than 1024, 2048, 4096. These divisible tricky sizes are dealt with separately in cl GEMM on AMD GPU. It might be different on Nvidia GPU.

Don't check one particular size, plot a line.

On Fri, Jan 22, 2016 at 4:22 AM, Billy Naumann notifications@github.com wrote:

There are multiple numbers in this file, I'm assuming for each tile size and unroll value. The line itself looks like this:

4096, 4096, 59.8307, 220.849, 240.196, 212.161, 461.437, 439.214, 334.463, 577.598, 590.172, 463.61, 741.16, 790.162, 556.23, 813.633, 941.988, 533.845, 1034.45, 1044.49, <-F|T->, 60.2822, 222.991, 240.973, 217.049, 466.821, 441.828, 0, 0, 0, 474.877, 760.203, 809.713, 0, 0, 0, 0, 0, 0, 96x96,

I've attached both generated outputfiles for comparison. For larger tile-sizes and unrolling-values the gflops value seems to be mostly increasing, even to a comparable level to the "original" input.

prof_sgemm_ksr.txt https://github.com/clMathLibraries/clBLAS/files/100618/prof_sgemm_ksr.txt prof_sgemm_raw.csv.txt https://github.com/clMathLibraries/clBLAS/files/100619/prof_sgemm_raw.csv.txt

— Reply to this email directly or view it on GitHub https://github.com/clMathLibraries/clBLAS/issues/220#issuecomment-173871175 .

Tingxing dong

tingxingdong commented 4 years ago

https://github.com/tingxingdong/clBLAS-private/wiki/How-to-tune-clBLAS-GEMM