Open junliume opened 10 months ago
hotfix is here: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/2497
The root cause is in this case wrw kernel using atomic, and in the solver it will use CastTensor()
function to cast from fp32->fp16
However, the in the source code of cast function : src\kernels\MIOpenSubTensorOpWithCastTensorKernel.cl
, it will clamp to MAX value.
_FLOAT_SRC temp_src = *(src + sindex + srcOffset);
#if MIOPEN_SRC_TYPE == 3 && MIOPEN_DST_TYPE == 4
temp_src *= alpha;
*(dst + dindex + dstOffset) = float_to_bfloat16(temp_src);
#else
bool over_flow = (alpha * ((float)temp_src)) >= ((float)MAX_VAL);
*(dst + dindex + dstOffset) =
(_FLOAT_DST)(over_flow ? MAX_VAL : alpha * ((float)temp_src));
#endif
}
@junliume I think need some one to evaluate why here need clamp to MAX
@junliume @JehandadKhan @carlushuang First of all, it should be noted that the convolution code we generate is focused on performance, not IEEE-754 correctness. Convolutions should provide maximum performance, provided that the accuracy is sufficient for the neural networks to function correctly.
That is why the OpenCL kernels (for example) are compiled with the -cl-fast-relaxed-math
option, which neither guarantees high precision for many operations nor provides correct handling of special numbers. In particular, this option enables -cl-finite-math-only
, whose meaning is obvious. See https://man.opencl.org/clBuildProgram.html for more info. HIP and ASM kernels are expected to follow the same policy wrt precision vs performance.
Please also note that the result of MAX instead of INF can be interpreted as a difference of 1ULP, which is sufficient accuracy for convolutions ;)
So the test that checks for INF on output is questionable.
From the other hand, if we want to make the test passing, fixing the cast kernel (as triaged by @carlushuang) is a way to go. But please note that while this resolves the immediate problem, we cannot always guarantee the expected special number result (INF) due to the the above.
WRT the implementation of the full-blown fix. Git blame shows that over_flow
flag is used in the kernels from the beginning. That is all what I found in the public repo. Therefore I highly recommend:
@carlushuang Thanks for triaging the issue!
I want to make two points. First, frameworks rely on backward convolution kernels for fp16 (and likewise fp8) producing infs when computation results overflow. It is not an IEEE-754 compliance issue, it is a fundamental requirement.
Second, whatever the reasoning was for that clamp into finite range, we can probably agree that applying it only on the positive side can't possibly be correct.
From the frameworks perspective, it would be best if the cast kernel clamped into finite range for forward but not backward convolutions. If that is too hard, then it should not clamp at all.
@JehandadKhan and @atamazov (CC: @carlushuang) we could change the logic and only "clamp" in the forward direction? Then we could let it be tested through a whole round of staging and trying to identify if there are any issues long before the next release.
@ekuznetsov139 Thanks for the explanation of the frameworks' requirement.
@JehandadKhan @carlushuang I can work on this if no one else is doing it yet. The plan is to implement the 3rd and 4th bullets from https://github.com/ROCmSoftwarePlatform/MIOpen/issues/2496#issuecomment-1788879539. It seems like we have little time. Q: shall we add a run-time or build-time parameter?
@junliume
we could change the logic and only "clamp" in the forward direction?
Sure, and this seems the best solution from the frameworks' POV.
@junliume @JehandadKhan The fix is implemented in #2538.
The fix is very conservative (minimal functional changes because release branching date is near). We should consider other option: remove clamping controls (introduced in this PR) and keep clamping permanent, but replace clamping to MAX with clamping to INF. This seems mathematically correct (and can't break https://github.com/ROCmSoftwarePlatform/MIOpen/issues/2496 again), and should slightly improve performance, but let's do that after release branching.
Therefore I recommend keeping this ticket open (after merging #2538) but lowering its urgency, and then closing after implementing the clamping to INF.
@junliume @ekuznetsov139 The fix is merged into develop
. If the problem is resolved, then I suggest lowering the urgency of this issue to "normal" and removing the milestone (see previous comment).
[Observations] On gfx90a platform, with the input input.txt
The IGEMM kernel gave output:
But the expected output (from direct convolution) should be:
[Analysis]
INF
@carlushuang could you help to comment on this issue if anything is missing? Thanks!