ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.05k stars 219 forks source link

[MIOpenDriver] Print out perf config after MIOpenDriver failed after tuning #1401

Open jerryyin opened 2 years ago

jerryyin commented 2 years ago

This is a debugging ticket brought up from @krzysz00 in MLIR team.

In using the MIOpenDriver to help MLIR tune different configs, it is common for MIOpen to pick up a different config than the heuristic config and validate the result. Occasionally when a different config generates a kernel that malfunctions, it results into a failure in MLIR CI. When this happens, it is hard for a developer who doesn't know much about MIOpen to reproduce this failure.

To improve this situation, we can implement either one of the following proposal:

atamazov commented 2 years ago

it is common for MIOpen to pick up a different config than the heuristic config and validate the result. Occasionally when a different config generates a kernel that malfunctions,

MIOpen validates each TuningConfig by means of Solver::IsValidPerformanceConfig(). It is the Solver who must guarantee that validated TuningConfig provides good kernels. So the root of the problem is a bug in the Solver.

When this happens, it is hard for a developer who doesn't know much about MIOpen to reproduce this failure.

Indeed, handling solver bugs require knowledge about MIOpen internals ;(

Add a input argument for MIOpenDriver to control the debug print to stdout

The Driver is unaware of tuning, so this won't help. However exporting MIOPEN_LOG_LEVEL=6 (or even MIOPEN_LOG_LEVEL=5) is often enough to identify the failing TuningConfig.

However, it may be that I misunderstand the problem. Could you describe the use case in more detail? For example, list the exact actions of a developer who occasionally runs into this.

atamazov commented 2 years ago

...it is hard ... to reproduce this failure.

Maybe the key question is how do you usually reproduce? Answer to this question would help me to better understand how can we deliver the necessary info to the user.

What we normally do (when we need to reproduce) is simply re-running the failing test on the local machine (with some instrumentation like logging etc).

krzysz00 commented 2 years ago

How I'd like to reproduce the case where tuning generates a kernel that then has incorrect results:

  1. Look at the failure log for a failing kernel config
  2. Copy out the conv_config parameter that was passed to MLIR
  3. Run ./bin/miopen-gen "--conv-config=[whatever the failing configuration was]" -pv_with_gpu | ./bin/mlir-miopen-driver -c | rocm-run to confirm the failure (possibly with -pv instead of -pv_with_gpu and so on)

What I currently have to do to reproduce a failure that arises after tuning

  1. As before, look at the logs, but now searching for the arguments passed to MIOpenDriver
  2. Re-compile llvm-project-mlir in a static library configuration
  3. Copy the static library out to MIOpen's dependencies directory
  4. Build MIOpen
  5. Run the failing tuning config again with MIOPEN_LOG_LEVEL=6
  6. Get the failing kernel config from a very noisy log
  7. Recompile MLIR into a shared library configuration to enable running internal tests
  8. ./bin/miopen-gen "--conv_config=[...]" -pv_with_gpu | ...

A whole lot of those steps are "toss out the build directory"-type recompiles, so a good several minutes each. Furthermore, since failures that only crop up at tuning are rare, none of this is scripted, so I can't even go have lunch in the middle of the process.

atamazov commented 2 years ago

From the reproduce instructions I do not see how MIOpenDriver is engaged in this problem (and consequently, how the modification of the MIOpenDriver can help). The procedure does not even mention the driver, except "...but now searching for the arguments passed to MIOpenDriver". Is it so that ./bin/mlir-miopen-driver invokes MIOpenDriver?

Hopefully the problem can be resolved at the mlir harness level. If not, then let's book a meeting and discuss.

Please also set the urgency and value labels for this ticket. Thanks!

/cc @junliume

krzysz00 commented 2 years ago

I'm afraid I left out some context. We call MIOpenDriver to test our tuning support and to make sure we can be called from MIOpen - the CI eventually runs https://github.com/ROCmSoftwarePlatform/llvm-project-mlir/blob/miopen-dialect/mlir/utils/jenkins/miopen-tests/miopen_validate.sh to call MIOpenDriver

mlir-miopen-driver is an unrelated tool that runs the code generation pipeline.

atamazov commented 2 years ago

The quick solution could be like this: add to MIOpen an env var that enables logging of PerfConfigs in MLIR solvers. Use log level 4 (Warning). Logging should happen in GetSolution(). Use that var in your script. Recommended name for var: MIOPEN_DEBUG_CONV_MLIR_LOG_TUNING_CONFIGS. Please try and let me know if it works for you.

atamazov commented 2 years ago

@jerryyin

This is a usability ticket

I think this is about debugging.

jerryyin commented 2 years ago

The Driver is unaware of tuning, so this won't help. However exporting MIOPEN_LOG_LEVEL=6 (or even MIOPEN_LOG_LEVEL=5) is often enough to identify the failing TuningConfig.

Hmmm that's right, I just realized that perf_config is an MIOpen internal thing, and that MIOpenDriver only call MIOpen API like a standalone application.

I think this is about debugging.

Updated the ticket description.

...add to MIOpen an env var that enables logging of PerfConfigs in MLIR solvers...

This seems like a reasonable way to do it. Once it is done, I believe the behavior will be:

Since this isn't the highest priority thing, I am placing it in my back burner for now.

ppanchad-amd commented 5 months ago

@jerryyin Has this been resolved with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!