CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.04k stars 205 forks source link

INT8 version of GEMM? #202

Open spedagadi opened 6 years ago

spedagadi commented 6 years ago

Hi

I am looking for a INT8 version of GEMM in OpenCL. If I am correct, CLBlast does not yet support it. Pls correct me if I am wrong and comment on the usage (perhaps a sample app etc.,).

Supposing INT8 variant is not yet present in CLBlast, have you come across any other works that you may recommend. I did run into this repo https://github.com/strin/gemm-android & then ARM's compute library https://github.com/ARM-software/ComputeLibrary/blob/master/src/core/CL/cl_kernels/gemm.cl

My goal is to extend my project https://github.com/sat8/YoloOCLInference to support INT8 models during inference. I have gathered few initial details on how to go about quantization from tensorflow https://www.tensorflow.org/performance/quantization and would like to implement it in my project but is in need of a INT8 version of GEMM. Tensorflow refers to https://github.com/google/gemmlowp which is a CPU & NEON optimized gemm, a CPU only library.

Any thoughts or comments would be appreciated.

CNugteren commented 6 years ago

I haven't done the research on INT8 yet, so I don't know of any other GEMM implementations with INT8.

Nevertheless, I think INT8 is an interesting topic for CLBlast. Having tackled FP16 already, I'd be willing to spend time on implementing such a feature, but I don't think it's easy, both on the host and device side many things will have to change going from floating-point to fixed-point. Also, what kind of hardware would you run this on? Hardware with native INT8 support? Does ARM Mali support this (given that it's in ARM's compute library)? Or do they pack 4 values together in a 32-bit integer? I'll have to read up on this topic a bit more in other to give you a proper answer.

spedagadi commented 6 years ago

thnx for the response.

Or do they pack 4 values together in a 32-bit integer? I think this may be true. You may want to check this out https://github.com/google/gemmlowp/blob/master/doc/quantization.md & reference code https://github.com/google/gemmlowp/blob/master/doc/quantization_example.cc

In tensorflow documentation, they highlight the range for mapping float to unsigned char based on experimentation image

If my understanding is correct, INT8 is not a special datatype, rather it's just an unsigned char value. Of course with multiplication & other math ops, bit depth of more than 8 may be required as the output. Say, GEMM in INT8 may produce a 32 bit unsigned int.

Also, what kind of hardware would you run this on? I am thinking of using low precision GEMM on Asus tinkerboard that has Mali™-T764 GPU, AMD RX580 & GTX 1080Ti. At this point, I am not sure of the speedup factor that INT8 based inference could produce over pure floating point math but would like to validate it to know it better.

Hardware with native INT8 support? NVIDIA cards does seem to have instructions such as dp4a which could generate some speedup but I am unsure about where such instructions are exposed in OpenCL on any hardware. For now, I am aiming to compare FP32 vs INT8 deep learning inference supposing GEMM is in INT8 and I optimize my inference kernels using byte data. I would think doing so would certainly generate speedup as it is widely claimed by almost all hardware vendors. Any hardware native INT8 optimizations could come later in my project.

naibaf7 commented 6 years ago

@sat8 https://github.com/naibaf7/caffe Has experimental int8 kernels for both CUDA and OpenCL if you're still interested to play with this.

Edit: I have to mention you'll not have the greatest time performance-wise. Turns out int8-FMAD is probably going to end up being int32-FMAD on AMD cards and the additional computations for quantization do cost as well. Especially shared memory and register costs. I haven't seen a DP4A equivalent on either AMD or Mali.

J0hnn4J1ang commented 6 years ago

Do you have any update for this issue now? Or road map?, Thanks.

CNugteren commented 6 years ago

No, not really. Not sure if I will ever work on this, other things have priority. But contributors are free to work on this of course.

What hardware would you run it on? What use-case do you have?

J0hnn4J1ang commented 6 years ago

Hi, I worked on one kind of miner algo, it needs batchs of size 256 by 256 int8 to int16 matrix multiplication. For nvidia cuda, already done, but amd opencl, seems not have a solution yet
As you do not have the plan, I think I will work it out myself.

CNugteren commented 6 years ago

Well, you could try naibaf7's implementation as mentioned above. But as he says, there is not much support for INT8 multiplications in hardware, so you'll probably won't gain much (or will actually lose) compared to FP32.

J0hnn4J1ang commented 6 years ago

@CNugteren Thanks for you info. Very appreciate it.

engineer1109 commented 10 months ago

INT8 GEMM is usually as s8s8s32. Like int c = (int8_t)a * (int8_t)b; The result use int, the input use int8_t