ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.06k stars 221 forks source link

Mismatch in `ConvHipImplicitGemmV4R1Fwd` #2038

Open JehandadKhan opened 1 year ago

JehandadKhan commented 1 year ago

MIOpen develop is failing due to an issue in one of the static implicit GEMM kernels. Steps to reproduce:

# clone MIOpen develop
mkdir build; cd build
CXX=/opt/rocm/llvm/bin/clang++ CXXFLAGS='-Werror'  cmake -DMIOPEN_TEST_FLAGS=' --disable-verification-cache ' -DCMAKE_BUILD_TYPE=release -DBUILD_DEV=On -DMIOPEN_USE_MLIR=ON -DMIOPEN_GPU_SYNC=Off  -DMIOPEN_TEST_BFLOAT16=On -DCMAKE_PREFIX_PATH=/opt/rocm    ..
make -j 32 test_conv2d
 ./bin/test_conv2d --bfloat16 --cmode conv --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 256 32 27 27 --weights 128 32 1 1 --batch_size 256 --input_channels 32 --output_channels 128 --spatial_dim_elements 27 27 --filter_dims 1 1 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW --deterministic 0 --tensor_vect 0 --vector_length 1  --int8_vectorize 0

Following is the current output

./bin/test_conv2d --bfloat16 --cmode conv --pmode default --group-count 1 --disable-backward-data --disable-backward-weights --input 256 32 27 27 --weights 128 32 1 1 --batch_size 256 --input_channels 32 --output_channels 128 --spatial_dim_elements 27 27 --filter_dims 1 1 --pads_strides_dilations 0 0 1 1 1 1 --trans_output_pads 0 0 --in_layout NCHW --fil_layout NCHW --out_layout NCHW --deterministic 0 --tensor_vect 0 --vector_length 1 --output_type int32 --int8_vectorize 0 
FAILED: 0.740628
Iteration: 0
Forward convolution: ConvHipImplicitGemmV4R1Fwd
Input tensor: 256, 32, 27, 27
Weights tensor: 128, 32, 1, 1
Output tensor: 256, 128, 27, 27
Filter: conv2d, miopenConvolution, miopenPaddingDefault, {0, 0}, {1, 1}, {1, 1}, 
Max diff: 1.32204e+10
Mismatch at 0: -126 != 4.42919e+09

Interestingly the issue has started to appear in our CI since commit: b4e0a67333ee4bbcbbec1203a0260feff2882cfb However, I have verified that the issue exists even in prior commits such as f1196f80d251bbeaf0eb6146c7e783fc1c61bd31

All tests done on MI100

This issue is currently blocking new PRs from being merged into develop.

junliume commented 1 year ago

@carlushuang could you help to take a look?

carlushuang commented 1 year ago

This is a non-xdlops kernel targeting NCHW layout, but I can't reproduce this failure on an MI200 machine. Is it tested on MI100 or Vega? image

junliume commented 1 year ago

@JehandadKhan could you clarify on the reproducing steps and env?

JehandadKhan commented 1 year ago

@carlushuang I tested on MI100 system.

carlushuang commented 1 year ago

@JehandadKhan this solver is targeting non-xdlops kernels, so performance will be not good. For MI100/MI200 there are alternative solvers like ConvHIPImplicitGemmForwardV4R4XDLops, and other asm solvers that support this case. So if this one has computation bug I think we have to disable it. What do you think?

cc @zjing14

junliume commented 1 year ago

@carlushuang @JehandadKhan : our CI is failing consistently on the following http://micimaster.amd.com/blue/organizations/jenkins/MLLibs%2FMIOpen/detail/develop/964/pipeline

atamazov commented 1 year ago

@carlushuang @junliume I do not remember BF16 precision problems with this solver. We need to find the root reason of the issue before trying to fix of workaround the issue. It could be, for example, a bug in the compiler. What is ROCm version?

So if this one has computation bug I think we have to disable it.

Please look at #936. Maybe this is a kind of verification bug.

atamazov commented 1 year ago

@JehandadKhan Is it so that the smoke_solver_ConvHipImplicitGemmV4R1 test is failing with BF16? This is interesting... What has changed? Maybe tuning?

I recommend renaming this ticket to "smoke_solver_ConvHipImplicitGemmV4R1 test is failing with BF16".

atamazov commented 1 year ago

@carlushuang @JehandadKhan @junliume

...For MI100/MI200 there are alternative solvers like ConvHIPImplicitGemmForwardV4R4XDLops, and other asm solvers that support this case. So if this one has computation bug I think we have to disable it. What do you think?

As we can see from #936, we have verification problems with this solver for a long time. The solver is originated from https://github.com/AMDComputeLibraries/MLOpen/pull/2132, and it seems that nobody has time to maintain it.

Therefore I agree with @carlushuang and would vote for disabling/removing ConvHipImplicitGemmV4R1Fwd, but we need to make sure that performance remains at the same level.

🟡 For now, I will prepare a W/A that disables ConvHipImplicitGemmV4R1Fwd for BF16 on xDLOPs targets.

/CC @asroy

atamazov commented 1 year ago

@carlushuang

...I can't reproduce this failure on an MI200 machine...

The solver is applicable for MI200 (please check ConvHipImplicitGemmV4R1Fwd::IsApplicable() to see). Maybe you have some environment setting that prevents this solver from running.

atamazov commented 1 year ago

Now I see the logs and know the symptom and can explain the root reason.

Symptom:

The smoke_solver_ConvHipImplicitGemmV4R1 fails. The test checks tuning, among other things. It performs only 5 rounds (to save time) of tuning and then uses the resulting tuning config (PerformanceConfig) to run the solver. Then the output fails validation.

The reason of failure

Specifically, the kernel produced during the smoke_solver_ConvHipImplicitGemmV4R1 test accidentally turned out to be the one that does not pass the correctness check.

The root reason

I think it matches the root reason of #936. It could be one of these two:

atamazov commented 1 year ago

🟡 According to the analysis above, it is highly likely that #2041 won't unblock the CI. I am going to prepare another W/A that disables tuning for ConvHipImplicitGemmV4R1Fwd during its smoke test.

junliume commented 1 year ago

@atamazov it seems that a workaround in https://github.com/ROCmSoftwarePlatform/MIOpen/commit/277e35c10f036833f781477f3db7bcb7ad52906d can actually let the unit test pass. I am still puzzled why this issue happens now, i.e. after tuning limitation is implemented.

[Reproduce the Issue]: Pull rocm/miopen:ci docker, and build MIOpen develop at e0179c0, then

make smoke_solver_ConvHipImplicitGemmV4R1 -j$(nproc) and ctest -R smoke_solver_ConvHipImplicitGemmV4R1

MIOpen(HIP): Info [PrintVersion] HIPRTC v.9.0
MIOpen(HIP): Error [Compile] 'hiprtcCompileProgram(prog.get(), c_options.size(), c_options.data())' static_kernel_gridwise_convolution_forward_implicit_gemm_v4r4_xdlops_nchw_kcyx_nkhw.cpp: HIPRTC_ERROR_COMPILATION (6)
MIOpen(HIP): Error [BuildHip] HIPRTC status = HIPRTC_ERROR_COMPILATION (6), source file: static_kernel_gridwise_convolution_forward_implicit_gemm_v4r4_xdlops_nchw_kcyx_nkhw.cpp
MIOpen(HIP): Warning [BuildHip] In file included from <built-in>:1:
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:939:49: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
      __hip_assert(false && "invalid tile size");
                                                ^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:986:49: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
      __hip_assert(false && "invalid tile size");
                                                ^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1192:62: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
      __hip_assert(false && "invalid cooperative group type");
                                                             ^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1216:62: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
      __hip_assert(false && "invalid cooperative group type");
                                                             ^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1245:62: error: empty expression statement has no effect; remove unnecessary ';' to silence this warning [-Werror,-Wextra-semi-stmt]
      __hip_assert(false && "invalid cooperative group type");
                                                             ^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1380:62: error: unused parameter 'g' [-Werror,-Wunused-parameter]
      const thread_block_tile_internal<tbtSize, tbtParentT>& g)
                                                             ^
/long_pathname_so_that_rpms_can_package_the_debug_info/src/out/ubuntu-20.04/20.04/build/hip-on-rocclr/src/hiprtc/hip_rtc_gen/hipRTC_header.h:1383:67: error: unused parameter 'g' [-Werror,-Wunused-parameter]
  __CG_QUALIFIER__ thread_block_tile_internal(const thread_block& g)
                                                                  ^
7 errors generated when compiling for gfx90a.

These warning messages apparently come from hipRTC/COMgr, but why it was not a problem before #1997 ?

atamazov commented 1 year ago

@junliume ~Because #1997 reorders tuning configs in the container in some random order. Now the first 5 configs are different.~

atamazov commented 1 year ago

@junliume Oh, no this is totally different issue. Let's discuss it separately.

junliume commented 1 year ago

@junliume Because #1997 reorders tuning configs in the container in some random order. Now the first 5 configs are different.

@atamazov Okay. Thanks! So when MIOpen failed to compile ConvHIPImplicitGemmForwardV4R4XDLops it moves on to ConvHipImplicitGemmV4R1Fwd and then have some numerical issues? BTW~ could you review if https://github.com/ROCmSoftwarePlatform/MIOpen/commit/277e35c10f036833f781477f3db7bcb7ad52906d is the right place to suppress these hipRTC warnings? Or there might be other places?

atamazov commented 1 year ago

@junliume This is different issue. Let's hide the comments about warning to avoid messing things.

atamazov commented 1 year ago

@junliume What I see is:

[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GetAllConfigs] ConvHipImplicitGemmV4R1Fwd: Searching the best solution among 33...
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 0/0/5 0.23334, best within recent 1: 0.23334 #0 16,32,8,2,2,2,4,4,2,4,8,1,16,1,4,32, ETA:0 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 1/0/5 0.23334, best within recent 1: 0.750709 #1 16,32,4,2,4,4,2,4,2,4,4,1,16,1,4,16, ETA:53.7877 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 2/0/5 0.23334, best within recent 1: 0.285917 #2 16,128,4,2,4,4,4,4,4,4,4,2,16,2,2,128, ETA:28.3322 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [Monitor] 3/0/5 0.114943, best within recent 1: 0.114943 #3 16,128,16,2,4,4,4,4,4,4,16,1,16,1,2,128, ETA:15.7516 sec.
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GenericSearch] Done: 5/0/5, best #3 0.114943 16,128,16,2,4,4,4,4,4,4,16,1,16,1,2,128
[2023-03-23T18:53:48.490Z] MIOpen(HIP): Warning [GenericSearch] ...Score: 2.44989 (default time 0.281597)
[2023-03-23T18:53:48.490Z] FAILED: 0.535381
[2023-03-23T18:53:48.490Z] Max diff: 255
[2023-03-23T18:53:48.490Z] Mismatch at 0: 30 != 129

No build warnings, just validation error.

junliume commented 1 year ago

@atamazov indeed it could be two separate issues: from https://github.com/ROCmSoftwarePlatform/MIOpen/issues/2038#issuecomment-1481975124

atamazov commented 1 year ago

why ConvHIPImplicitGemmForwardV4R4XDLops is not picked when it should. This is about the warnings in compilation and should be mentioned in a separate issue (thus hide here);

As far as I see it is the smoke_solver_ConvHipImplicitGemmV4R1 test that fails in our CI. The ConvHIPImplicitGemmForwardV4R4XDLops solver shouldn't be picked in that test.

Maybe you and @JehandadKhan are observing some different problem. Unfortunately the topmost description misses the name of the specific test that fails in that case.

junliume commented 1 year ago

@atamazov sorry for the confusion on a separate issue with this one. I am running ctest -R smoke_solver_ConvHipImplicitGemmV4R1 and having run to run issue (almost fail about half times). The passing and failing logs are attached failing_mark.log passing_mark.log

atamazov commented 1 year ago

@junliume Thanks for logs. The instability is due to randomization of tuning configs introduced in #1997. In your logs, this passes:

ConvHipImplicitGemmV4R1Fwd:16,64,8,2,2,2,4,4,4,4,8,2,16,1,4,64

This fails:

ConvHipImplicitGemmV4R1Fwd:16,64,16,2,2,2,4,4,4,4,16,1,16,1,4,64

Note that both logs end with

smoke_solver_ConvHipImplicitGemmV4R1 ...***Failed  Error regular expression found in output. Regex=[(FAILED)|(Error)|(failed)]

This is because:

atamazov commented 1 year ago

@junliume Some clarification about suspected reason (B) listed at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/2038#issuecomment-1481947208, "The validation procedure used in our tests often produces false positives and needs to be improved." The order of computations performed by the kernel under test and by the reference data generator is important and affects RMS. This is especially important for the shortened data types, like FP16. When the computation orders become too different, the RMS may exceed the tolerance limit we have set, even if the kernel under test does all the necessary operations.

There are validation algorithms that do not depend on the order of computations, but it would take a huge amount of work and time to replace the existing verification algorithms (and that's why it wasn't done yet).

So far I recommend the following:

atamazov commented 1 year ago

2043 is merged now.


[New Attribution] @junliume @johnny-keker https://github.com/ROCmSoftwarePlatform/MIOpen/labels/bug https://github.com/ROCmSoftwarePlatform/MIOpen/labels/complexity_high https://github.com/ROCmSoftwarePlatform/MIOpen/labels/correctness https://github.com/ROCmSoftwarePlatform/MIOpen/labels/testing https://github.com/ROCmSoftwarePlatform/MIOpen/labels/urgency_normal https://github.com/ROCmSoftwarePlatform/MIOpen/labels/workaround

JehandadKhan commented 1 year ago

@atamazov Can we close this issue ?

atamazov commented 1 year ago

No, because workaround still exists in our code.