Open pavanky opened 8 years ago
@TimmyLiu We can fix the issue and send in a PR, but we are not sure if we'd be comprehensive. Can you provide us a list of kernels of where only "B1" is implemented?
An easy solution would be to call clEnqueueFillBuffer
for when beta == 0
in these calls.
I just realized clEnqueueFillBuffer
may not work because of strides.
hi @pavanky . I see. all the special kernels are here: https://github.com/clMathLibraries/clBLAS/tree/master/src/library/blas/AutoGemm/UserGemmKernelSources although the fastest way is to bypass the special kernels with beta is zero.
For example the following exists (where beta != 0):
But this does not (where beta == 0):
This is a problem when "C" is not initialized properly. For example when "C" is just allocated but not explicitly set to 0, sometimes the initial values can be NaN. Multiplying this with a 0 will still result in NaNs.This propagates NaNs to the outputs.
One could argue this is according to the blas spec, but we haven't noticed this behavior in other BLAS implementations.