CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.06k stars 202 forks source link

tunner transpose fails on various specific sizes #532

Open baryluk opened 9 months ago

baryluk commented 9 months ago

CLBlast-1.6.2-linux-x86_64

LD_LIBRARY_PATH=./lib ./bin/clblast_tuner_transpose_fast --platform 1 -m 2 -n 16

|   ID | total |               param |      local      |      global     |       compiles |         time |   GB/s |            status |
x------x-------x---------------------x-----------------x-----------------x----------------x--------------x--------x-------------------x
|  ref |     - |                   - |       8       8 |       8      16 |             OK |      0.03 ms |      - |      reference OK |
x------x-------x---------------------x-----------------x-----------------x----------------x--------------x--------x-------------------x
|    1 |    52 |    4    1    0    0 |       4       4 |       4      16 |   OK     84 ms |      0.03 ms |      - | L2 error 4.30e-01 | <-- skipping
|    2 |    52 |    4    1    0    1 |       4       4 |       4      16 |   OK     81 ms |      0.03 ms |      - | L2 error 1.20e+00 | <-- skipping
|    3 |    52 |    4    1    1    0 |       4       4 |       4      16 |   OK     83 ms |      0.03 ms |      - | L2 error 4.30e-01 | <-- skipping
|    4 |    52 |    4    1    1    1 |       4       4 |       4      16 |   OK     95 ms |      0.03 ms |      - | L2 error 1.20e+00 | <-- skipping
|    5 |    52 |    4    2    0    0 |       4       4 |       4       8 |   OK     95 ms |      0.03 ms |      - | L2 error 6.83e-01 | <-- skipping
|    6 |    52 |    4    2    0    1 |       4       4 |       4       8 |   OK     95 ms |      0.03 ms |      - | L2 error 1.61e+00 | <-- skipping
|    7 |    52 |    4    2    1    0 |       4       4 |       4       8 |   OK     91 ms |      0.03 ms |      - | L2 error 6.83e-01 | <-- skipping
^C

Example of sizes that do fail:

m==2, (n==4 || n>=16)
m==3, (n==15 || n==16 || n>=25)
m==4, (n==9||n>=10)
m==5, (n==1 || n==8 || n==9 || n==12 || n>=20)
m==9, (n==8 || n==10 || n==15 || 20<=n<=48 || n>=64)
m==10, (20<=n<=32 || n>=50)
m==12, (20<=n<=50 || n>=92)
#!/bin/bash

for m in 1 2 3 4 5 8 9 10 12 15 16 20 25 30 32 48 50 64 92 100 128 156 200 256 300 384 400 500 512 1000 1024 1200 1600 2000 2048 3000 4000 4096 5000 8192; do
for n in 1 2 3 4 5 8 9 10 12 15 16 20 25 30 32 48 50 64 92 100 128 156 200 256 300 384 400 500 512 1000 1024 1200 1600 2000 2048 3000 4000 4096 5000 8192; do
  echo "m: $m n: $n" "$(LD_LIBRARY_PATH=./lib ./bin/clblast_tuner_transpose_fast --platform 1 -runs 1 -m $m -n $n | grep -E 'Best parameters:')"
done
done
CNugteren commented 9 months ago

That is expected behaviour. The tuner simply runs a specific kernel, and certain kernels have certain constraints, also dependent on the tuner parameters. That's why those cases are skipped.

Furthermore, it is probably not a good idea to tune for these tiny input size, because the main you'll measure is kernel launch time overhead and similar things. Probably best to start at 64x64 or even higher.