CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.06k stars 202 forks source link

Tuner stuck in 'dead lock' and never completes #546

Open diverger opened 4 months ago

diverger commented 4 months ago

Hi, When running the 'make alltuners' on a Mali GPU, some tunes run hours long. And finally it stuck there and never return. Are there any methods to speed up?

CNugteren commented 4 months ago

There are a few ways.

First of all, you could modify the tuner's file, e.g. CLBlast/src/tuning/kernels/xgemm.hpp and reduce the number of parameters in settings.parameters in multiple places, e.g. change {16, 32, 64} into {16, 32} for example.

Secondly, you could change the --fraction command-line argument (of e.g. clblast_tuner_xgemm) to something below 1.0 to not test everything.

Thirdly, you could tune only for the precision you need, e.g. single-precision (32) float only, and skip the other tuners. Basically make alltuners first compiles everything and then runs all the tuners (e.g. ./clblast_tuner_xgemm --precision 32) for all precisions after each other.

Lastly, for GEMM specifically there are 4 parts being tuned (from CLBlast/src/tuning/kernels/xgemm.cpp):

    printf("* (1/4) Tuning main GEMM kernel (GEMMK == 0) for fixed set of parameters\n\n");
    StartVariation<1>(argc, argv);
    printf("* (2/4) Tuning main GEMM kernel (GEMMK == 0) for random parameters out of larger set\n\n");
    StartVariation<2>(argc, argv);
    printf("* (3/4) Tuning secondary GEMM kernel (GEMMK == 1) for fixed set of parameters\n\n");
    StartVariation<11>(argc, argv);
    printf("* (4/4) Tuning secondary GEMM kernel (GEMMK == 1) for random parameters out of larger set\n\n");
    StartVariation<12>(argc, argv);

You could skip steps 2/4 and 4/4 to save time.

diverger commented 4 months ago

Can I achieve these by modifying the CMakefileList.txt?

CNugteren commented 3 months ago

No, I don't think so.