Closed atamazov closed 3 months ago
@CAHEK7 Maybe later (I do not have perf tests for this primitive on hand). This PR is about correctness and I am quite sure that is doesn't lead to perf degradations (hope you sure too). Also I suspect that the most of time on GPU is spent for address/offset computations, so the expected perf gain is pretty small.
The primitive produces invalid results when BETA=0 and output buffer contains junk (NaNs). This PR fixes the issue.
By-products:
Related issue:
2828
[Attribution] @junliume @JehandadKhan