Open JehandadKhan opened 1 year ago
@carlushuang could you help to take a look?
This is a non-xdlops kernel targeting NCHW layout, but I can't reproduce this failure on an MI200 machine. Is it tested on MI100 or Vega?
@JehandadKhan could you clarify on the reproducing steps and env?
@carlushuang I tested on MI100 system.
@JehandadKhan this solver is targeting non-xdlops kernels, so performance will be not good. For MI100/MI200 there are alternative solvers like ConvHIPImplicitGemmForwardV4R4XDLops
, and other asm solvers that support this case. So if this one has computation bug I think we have to disable it. What do you think?
cc @zjing14
@carlushuang @JehandadKhan : our CI is failing consistently on the following http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/develop/964/pipeline
@carlushuang @junliume I do not remember BF16 precision problems with this solver. We need to find the root reason of the issue before trying to fix of workaround the issue. It could be, for example, a bug in the compiler. What is ROCm version?
So if this one has computation bug I think we have to disable it.
Please look at #936. Maybe this is a kind of verification bug.
@JehandadKhan Is it so that the smoke_solver_ConvHipImplicitGemmV4R1
test is failing with BF16? This is interesting... What has changed? Maybe tuning?
I recommend renaming this ticket to "smoke_solver_ConvHipImplicitGemmV4R1 test is failing with BF16".
@carlushuang @JehandadKhan @junliume
...For MI100/MI200 there are alternative solvers like
ConvHIPImplicitGemmForwardV4R4XDLops
, and other asm solvers that support this case. So if this one has computation bug I think we have to disable it. What do you think?
As we can see from #936, we have verification problems with this solver for a long time. The solver is originated from https://github.com/AMDComputeLibraries/MLOpen/pull/2132, and it seems that nobody has time to maintain it.
Therefore I agree with @carlushuang and would vote for disabling/removing ConvHipImplicitGemmV4R1Fwd, but we need to make sure that performance remains at the same level.
/CC @asroy
@carlushuang
...I can't reproduce this failure on an MI200 machine...
The solver is applicable for MI200 (please check ConvHipImplicitGemmV4R1Fwd::IsApplicable() to see). Maybe you have some environment setting that prevents this solver from running.
Now I see the logs and know the symptom and can explain the root reason.
The smoke_solver_ConvHipImplicitGemmV4R1
fails. The test checks tuning, among other things. It performs only 5 rounds (to save time) of tuning and then uses the resulting tuning config (PerformanceConfig) to run the solver. Then the output fails validation.
Specifically, the kernel produced during the smoke_solver_ConvHipImplicitGemmV4R1
test accidentally turned out to be the one that does not pass the correctness check.
I think it matches the root reason of #936. It could be one of these two:
ConvHipImplicitGemmV4R1Fwd
solver is invalid. Specifically, it does not comply the requirement for IsValidPerformanceConfig()
listed at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/866#issuecomment-822585406 which says that ""Valid" means that ... Result of execution would be numerically correct." In other words, all the PerformanceConfigs used in tuning process should produce kernels that can pass validation.@atamazov it seems that a workaround in https://github.com/ROCmSoftwarePlatform/MIOpen/commit/277e35c10f036833f781477f3db7bcb7ad52906d can actually let the unit test pass. I am still puzzled why this issue happens now, i.e. after tuning limitation is implemented.
[Reproduce the Issue]:
Pull rocm/miopen:ci
docker, and build MIOpen develop at e0179c0, then
make smoke_solver_ConvHipImplicitGemmV4R1 -j$(nproc)
and ctest -R smoke_solver_ConvHipImplicitGemmV4R1
MIOpen(HIP): Info [PrintVersion] HIPRTC v.9.0
MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' static_kernel_gridwise_convolution_forward_implicit_gemm_v4r4_xdlops_nchw_kcyx_nkhw.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: static_kernel_gridwise_convolution_forward_implicit_gemm_v4r4_xdlops_nchw_kcyx_nkhw.cpp
MIOpen(HIP): Warning [BuildHip] In file included from <built-in>:1:
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:939:49: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
__hip_assert(false && "invalid tile size");
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:986:49: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
__hip_assert(false && "invalid tile size");
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1192:62: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
__hip_assert(false && "invalid cooperative group type");
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1216:62: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
__hip_assert(false && "invalid cooperative group type");
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1245:62: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
__hip_assert(false && "invalid cooperative group type");
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1380:62: error: unused parameter 'g' [-Werror,-Wunused-parameter]
const thread_block_tile_internal<tbtSize, tbtParentT>& g)
^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1383:67: error: unused parameter 'g' [-Werror,-Wunused-parameter]
__CG_QUALIFIER__ thread_block_tile_internal(const thread_block& g)
^
7 errors generated when compiling for gfx90a.
These warning messages apparently come from hipRTC/COMgr, but why it was not a problem before #1997 ?
@junliume ~Because #1997 reorders tuning configs in the container in some random order. Now the first 5 configs are different.~
@junliume Oh, no this is totally different issue. Let's discuss it separately.
@junliume Because #1997 reorders tuning configs in the container in some random order. Now the first 5 configs are different.
@atamazov Okay. Thanks! So when MIOpen failed to compile ConvHIPImplicitGemmForwardV4R4XDLops
it moves on to ConvHipImplicitGemmV4R1Fwd
and then have some numerical issues?
BTW~ could you review if https://github.com/ROCmSoftwarePlatform/MIOpen/commit/277e35c10f036833f781477f3db7bcb7ad52906d is the right place to suppress these hipRTC warnings? Or there might be other places?
@junliume This is different issue. Let's hide the comments about warning to avoid messing things.
@junliume What I see is:
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GetAllConfigs] ConvHipImplicitGemmV4R1Fwd: Searching the best solution among 33...
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 0/0/5 0.23334, best within recent 1: 0.23334 #0 16,32,8,2,2,2,4,4,2,4,8,1,16,1,4,32, ETA:0 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 1/0/5 0.23334, best within recent 1: 0.750709 #1 16,32,4,2,4,4,2,4,2,4,4,1,16,1,4,16, ETA:53.7877 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 2/0/5 0.23334, best within recent 1: 0.285917 #2 16,128,4,2,4,4,4,4,4,4,4,2,16,2,2,128, ETA:28.3322 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 3/0/5 0.114943, best within recent 1: 0.114943 #3 16,128,16,2,4,4,4,4,4,4,16,1,16,1,2,128, ETA:15.7516 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GenericSearch] Done: 5/0/5, best #3 0.114943 16,128,16,2,4,4,4,4,4,4,16,1,16,1,2,128
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GenericSearch] ...Score: 2.44989 (default time 0.281597)
[2023-03-23T18:53:48.490Z] FAILED: 0.535381
[2023-03-23T18:53:48.490Z] Max diff: 255
[2023-03-23T18:53:48.490Z] Mismatch at 0: 30 != 129
No build warnings, just validation error.
@atamazov indeed it could be two separate issues: from https://github.com/ROCmSoftwarePlatform/MIOpen/issues/2038#issuecomment-1481975124
ConvHIPImplicitGemmForwardV4R4XDLops
is not picked when it should. This is about the warnings in compilation and should be mentioned in a separate issue (thus hide here);ConvHipImplicitGemmV4R1Fwd
is picked, why the test fails due to mismatch. I agree this is the main topic of this thread.why ConvHIPImplicitGemmForwardV4R4XDLops is not picked when it should. This is about the warnings in compilation and should be mentioned in a separate issue (thus hide here);
As far as I see it is the smoke_solver_ConvHipImplicitGemmV4R1
test that fails in our CI. The ConvHIPImplicitGemmForwardV4R4XDLops
solver shouldn't be picked in that test.
Maybe you and @JehandadKhan are observing some different problem. Unfortunately the topmost description misses the name of the specific test that fails in that case.
@atamazov sorry for the confusion on a separate issue with this one.
I am running ctest -R smoke_solver_ConvHipImplicitGemmV4R1
and having run to run issue (almost fail about half times). The passing and failing logs are attached
failing_mark.log
passing_mark.log
@junliume Thanks for logs. The instability is due to randomization of tuning configs introduced in #1997. In your logs, this passes:
ConvHipImplicitGemmV4R1Fwd:16,64,8,2,2,2,4,4,4,4,8,2,16,1,4,64
This fails:
ConvHipImplicitGemmV4R1Fwd:16,64,16,2,2,2,4,4,4,4,16,1,16,1,4,64
Note that both logs end with
smoke_solver_ConvHipImplicitGemmV4R1 ...***Failed Error regular expression found in output. Regex=[(FAILED)|(Error)|(failed)]
This is because:
@junliume Some clarification about suspected reason (B) listed at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/2038#issuecomment-1481947208, "The validation procedure used in our tests often produces false positives and needs to be improved." The order of computations performed by the kernel under test and by the reference data generator is important and affects RMS. This is especially important for the shortened data types, like FP16. When the computation orders become too different, the RMS may exceed the tolerance limit we have set, even if the kernel under test does all the necessary operations.
There are validation algorithms that do not depend on the order of computations, but it would take a huge amount of work and time to replace the existing verification algorithms (and that's why it wasn't done yet).
So far I recommend the following:
[New Attribution] @junliume @johnny-keker https://github.com/ROCmSoftwarePlatform/MIOpen/labels/bug https://github.com/ROCmSoftwarePlatform/MIOpen/labels/complexity_high https://github.com/ROCmSoftwarePlatform/MIOpen/labels/correctness https://github.com/ROCmSoftwarePlatform/MIOpen/labels/testing https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_normal https://github.com/ROCmSoftwarePlatform/MIOpen/labels/workaround
@atamazov Can we close this issue ?
No, because workaround still exists in our code.
MIOpen develop is failing due to an issue in one of the static implicit GEMM kernels. Steps to reproduce:
Following is the current output
Interestingly the issue has started to appear in our CI since commit:
b4e0a67333ee4bbcbbec1203a0260feff2882cfb
However, I have verified that the issue exists even in prior commits such asf1196f80d251bbeaf0eb6146c7e783fc1c61bd31
All tests done on MI100
This issue is currently blocking new PRs from being merged into develop.