daadaada / turingas

Assembler for NVIDIA Volta and Turing GPUs
MIT License
196 stars 41 forks source link

fp16 winograd #7

Closed clarencewxl closed 3 years ago

clarencewxl commented 3 years ago

In the paper, you mentioned that the implementation can be ported to fp16 version. So, have you succeed in implementing fp16 Winograd with tensor-core and beating the performance of the cudnn.

I found that the cudnn doesn't have fp16 Winograd convolution3x3 but only fp16 gemm convolution3x3. I have no idea why Nvidia doesn't implement one.

daadaada commented 3 years ago

Hi.

I have not implemented fused Tensor Core fp16 Winograd yet.

I believe cuDNN's non-fused Winograd leverages Tensor Core.