In the paper, you mentioned that the implementation can be ported to fp16 version.
So, have you succeed in implementing fp16 Winograd with tensor-core and beating the performance of the cudnn.
I found that the cudnn doesn't have fp16 Winograd convolution3x3 but only fp16 gemm convolution3x3. I have no idea why Nvidia doesn't implement one.
In the paper, you mentioned that the implementation can be ported to fp16 version. So, have you succeed in implementing fp16 Winograd with tensor-core and beating the performance of the cudnn.
I found that the cudnn doesn't have fp16 Winograd convolution3x3 but only fp16 gemm convolution3x3. I have no idea why Nvidia doesn't implement one.