DeMoriarty / custom_matmul_kernels

Customized matrix multiplication kernels
GNU General Public License v3.0
54 stars 6 forks source link

‘main’ branch seems to be error #4

Open wxthu opened 2 years ago

wxthu commented 2 years ago

I switched to another machine and run the main branch, but got some compiling errors... image

when I checkout to master branch, it just worked. However, the performance of customizd kernel is out of expectations compared to torch build-in interface. I am confused

DeMoriarty commented 2 years ago

the kernel in master branch is an older version. I have fixed the bug in the main branch, can you try again?

wxthu commented 2 years ago

the kernel in master branch is an older version. I have fixed the bug in the main branch, can you try again?

I think you misunderstood me. Master branch could work but performance is bad. However, main branch could not run and got some compiling errors when I tried run. The compiled error is as aboved

DeMoriarty commented 2 years ago

the performance of master branch isn't good, because its an older version of the bmm kernel, which is not as optmized as the kernel in the main branch. I have fixed the bug that's causing the main branch to have compiling error. so please try to run the kernel in the main branch again.

wxthu commented 2 years ago

the performance of master branch isn't good, because its an older version of the bmm kernel, which is not as optmized as the kernel in the main branch. I have fixed the bug that's causing the main branch to have compiling error. so please try to run the kernel in the main branch again.

Thanks, I have tried again and it really worked. BTW, this kernel is not hardware-agnostic so I need to tune some parameters or re-write the cuda kernel to get better performance on NVIDIA RTX-3090, right?

DeMoriarty commented 2 years ago

Yes, as I explained in this blog post , this kernel is optimized for Turing series GPUs (such as Tesla T4, RTX 2080, Titan RTX...). For better performance on Ampere GPUs, it will be necessary to redesign certain parts of the kernel.

wxthu commented 2 years ago

Yes, as I explained in this blog post , this kernel is optimized for Turing series GPUs (such as Tesla T4, RTX 2080, Titan RTX...). For better performance on Ampere GPUs, it will be necessary to redesign certain parts of the kernel.

Would you give some advice about what characteristics of hardware platform we should consider to design the performance of kernels. Thank you very much

DeMoriarty commented 2 years ago

I'd recommend you to look into cutlass, which is open sourced and have reliable performance on varius gpu architectures.

wxthu commented 2 years ago

Thanks so much!