[Fp16] MIOpen integration or layout transpose issue with FP16_INF and FP16_MAX

junliume commented 10 months ago

[Observations] On gfx90a platform, with the input input.txt

# /opt/rocm/bin/MIOpenDriver convfp16 -n 5 -c 1 -H 3 -W 3 -k 3 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0 --dump_output 1 --in_data input.txt.bin --dout_data dout.txt.bin
MIOpenDriver convfp16 -n 5 -c 1 -H 3 -W 3 -k 3 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 4 -t 1 -V 0 --dump_output 1 --in_data input.txt.bin --dout_data dout.txt.bin
Read data from input file input.txt.bin
Could not open file dout.txt.bin for reading
PRNG seed: 12345678
Wrote output to file dump_in.bin
Wrote output to file dump_dout.bin
MIOpen Backward Weights Conv. Algorithm: 5, Solution: 110/ConvAsmImplicitGemmGTCDynamicWrwXdlopsNHWC
GPU Kernel Time Backward Weights Conv. Elapsed: 0.024996 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: bwdw-conv1x1u1, 5, 1, 3, 3, 1, 1, 3,  270, 0, 0, 0, 0, 0.024996
Wrote output to file dump_bwd_dwei_gpu.bin

The IGEMM kernel gave output:

[65504. 65504. 65504.]

But the expected output (from direct convolution) should be:

[inf, inf, inf]

[Analysis]

65504 is FP16_MAX
It is determined that the iGEMM kernel actually generates INF
@JehandadKhan @atamazov Now it is suspected that during data-layout transpose or MIOpen integration, the INF is reinterpreted as MAX?

@carlushuang could you help to comment on this issue if anything is missing? Thanks!

carlushuang commented 10 months ago

hotfix is here: https://github.com/ROCmSoftwarePlatform/MIOpen/pull/2497

carlushuang commented 10 months ago

The root cause is in this case wrw kernel using atomic, and in the solver it will use CastTensor() function to cast from fp32->fp16 However, the in the source code of cast function : src\kernels\MIOpenSubTensorOpWithCastTensorKernel.cl, it will clamp to MAX value.

            _FLOAT_SRC temp_src = *(src + sindex + srcOffset);
#if MIOPEN_SRC_TYPE == 3 && MIOPEN_DST_TYPE == 4
            temp_src *= alpha;
            *(dst + dindex + dstOffset) = float_to_bfloat16(temp_src);
#else
            bool over_flow = (alpha * ((float)temp_src)) >= ((float)MAX_VAL);
            *(dst + dindex + dstOffset) =
                (_FLOAT_DST)(over_flow ? MAX_VAL : alpha * ((float)temp_src));
#endif
        }

@junliume I think need some one to evaluate why here need clamp to MAX

atamazov commented 10 months ago

@junliume @JehandadKhan @carlushuang First of all, it should be noted that the convolution code we generate is focused on performance, not IEEE-754 correctness. Convolutions should provide maximum performance, provided that the accuracy is sufficient for the neural networks to function correctly.

That is why the OpenCL kernels (for example) are compiled with the -cl-fast-relaxed-math option, which neither guarantees high precision for many operations nor provides correct handling of special numbers. In particular, this option enables -cl-finite-math-only, whose meaning is obvious. See https://man.opencl.org/clBuildProgram.html for more info. HIP and ASM kernels are expected to follow the same policy wrt precision vs performance.

Please also note that the result of MAX instead of INF can be interpreted as a difference of 1ULP, which is sufficient accuracy for convolutions ;)

So the test that checks for INF on output is questionable.

From the other hand, if we want to make the test passing, fixing the cast kernel (as triaged by @carlushuang) is a way to go. But please note that while this resolves the immediate problem, we cannot always guarantee the expected special number result (INF) due to the the above.

WRT the implementation of the full-blown fix. Git blame shows that over_flow flag is used in the kernels from the beginning. That is all what I found in the public repo. Therefore I highly recommend:

(1) Investigate the code (git blame and observe review comments etc) in the private repo, starting from https://github.com/AMDComputeLibraries/MLOpen/pull/1327
(2) If the investigation gives clear results (we want to know why this conversion of INF to MAX is performed), then develop and apply correct fix.
(3) Otherwise (if we do not clearly know the reasons of the questionable conversion), I recommend skipping it but only when the kernel is used in BwdData and BwdWeights convolutions. That would require adding one more build-time parameter to the kernel or adding one more real-time parameter.
(3) Similar fix should be applied to other casting kernels and other convolution solvers that leverage them.

@carlushuang Thanks for triaging the issue!

ekuznetsov139 commented 10 months ago

I want to make two points. First, frameworks rely on backward convolution kernels for fp16 (and likewise fp8) producing infs when computation results overflow. It is not an IEEE-754 compliance issue, it is a fundamental requirement.

Second, whatever the reasoning was for that clamp into finite range, we can probably agree that applying it only on the positive side can't possibly be correct.

From the frameworks perspective, it would be best if the cast kernel clamped into finite range for forward but not backward convolutions. If that is too hard, then it should not clamp at all.

junliume commented 10 months ago

@JehandadKhan and @atamazov (CC: @carlushuang) we could change the logic and only "clamp" in the forward direction? Then we could let it be tested through a whole round of staging and trying to identify if there are any issues long before the next release.

atamazov commented 10 months ago

@ekuznetsov139 Thanks for the explanation of the frameworks' requirement.

@JehandadKhan @carlushuang I can work on this if no one else is doing it yet. The plan is to implement the 3rd and 4th bullets from https://github.com/ROCmSoftwarePlatform/MIOpen/issues/2496#issuecomment-1788879539. It seems like we have little time. Q: shall we add a run-time or build-time parameter?

@junliume

we could change the logic and only "clamp" in the forward direction?

Sure, and this seems the best solution from the frameworks' POV.

atamazov commented 10 months ago

@junliume @JehandadKhan The fix is implemented in #2538.

The fix is very conservative (minimal functional changes because release branching date is near). We should consider other option: remove clamping controls (introduced in this PR) and keep clamping permanent, but replace clamping to MAX with clamping to INF. This seems mathematically correct (and can't break https://github.com/ROCmSoftwarePlatform/MIOpen/issues/2496 again), and should slightly improve performance, but let's do that after release branching.

Therefore I recommend keeping this ticket open (after merging #2538) but lowering its urgency, and then closing after implementing the clamping to INF.

atamazov commented 10 months ago

@junliume @ekuznetsov139 The fix is merged into develop. If the problem is resolved, then I suggest lowering the urgency of this issue to "normal" and removing the milestone (see previous comment).

ROCm / MIOpen

[Fp16] MIOpen integration or layout transpose issue with FP16_INF and FP16_MAX #2496