ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.09k stars 231 forks source link

GenericSearch Patience #3215

Closed cderb closed 3 months ago

cderb commented 3 months ago

Adds environment variable MIOPEN_TUNING_PATIENCE which will allow the user to set the maximum number of performance configurations GenericSearch will iterate through without improvement before quitting.

cderb commented 3 months ago

@CAHEK7 GenericSearch is randomizing performance config order for the solver: https://github.com/ROCm/MIOpen/blob/b6e2e7d4342a9d3b44e307dac61562eee8a2070a/src/include/miopen/generic_search.hpp#L393-L398 So each call to generic search would look like:

Alg0
  tune2
  tune0
  tune1

Would you propose a change to GenericSearch where the solvers are evaluated/tested in an interleaved fashion? This would require something like a re-fashioning of the function to take a list of solvers instead.

averinevg commented 3 months ago

@cderb What is the difference between MIOPEN_DEBUG_TUNING_ITERATIONS_MAX and newly added MIOPEN_TUNING_PATIENCE?

CAHEK7 commented 3 months ago

@CAHEK7 GenericSearch is randomizing performance config order for the solver:

https://github.com/ROCm/MIOpen/blob/b6e2e7d4342a9d3b44e307dac61562eee8a2070a/src/include/miopen/generic_search.hpp#L393-L398

So each call to generic search would look like:

Alg0
  tune2
  tune0
  tune1

Would you propose a change to GenericSearch where the solvers are evaluated/tested in an interleaved fashion? This would require something like a re-fashioning of the function to take a list of solvers instead.

Random shuffle inside one algorithm is better that no shuffle at all, but if the algorithm has A LOT of tuning cases and is very stable for that particular problem, then we may stick with this algorithm. But probably it is intended. But probably it's an intension to stop tuning if we've found such algorithm.

cderb commented 3 months ago

@cderb What is the difference between MIOPEN_DEBUG_TUNING_ITERATIONS_MAX and newly added MIOPEN_TUNING_PATIENCE?

MIOPEN_DEBUG_TUNING_ITERATIONS_MAX is a hard cap on the number of tuning iterations, MIOPEN_TUNING_PATIENCE is a cap on the # of iterations without improvement. If performance improves, the termination count resets.

cderb commented 3 months ago

Random shuffle inside one algorithm is better that no shuffle at all, but if the algorithm has A LOT of tuning cases and is very stable for that particular problem, then we may stick with this algorithm. But probably it is intended. But probably it's an intension to stop tuning if we've found such algorithm.

This shuffle is meant to break up the configs within the algorithm. To facilitate the random sampling of that algorithm while it is being tuned. This would make it so similar configs would have less spatial proximity. So if an env like MIOPEN_TUNING_PATIENCE or MIOPEN_DEBUG_TUNING_ITERATIONS_MAX is set it is more likely a wider range of configs are sampled.

atamazov commented 3 months ago

@cderb @CAHEK7 @averinevg @junliume ~Unfortunately, the effectiveness of MIOPEN_TUNING_PATIENCE depends on distribution of fast PerfConfigs within virtual container (which, in turn, may depend on the Problem).~

The potential issue is using the number of iterations as a limit. It puts ASM kernels at a disadvantage compared to OCL kernels, and puts OCL kernels at a disadvantage compared to HIP kernels. For example building 100 asm kernels would take ~5 sec, while building 100 HIP kernels may take 5 minutes or more. The number-of-iterations limit that works for HIP will not do any good with ASM but may unnecessarily affect ASM performance.

[Notice] That's why MIOPEN_DEBUG_TUNING_ITERATIONS_MAX, as it name suggests, is intended for debugging/testing purposes only.

[Recommendation] Rename MIOPEN_TUNING_PATIENCE to MIOPEN_TUNING_PATIENCE_ITERATIONS_MAX, or, better, replace it with MIOPEN_TUNING_PATIENCE_TIME_MS_MAX.

To me, the most promising approach is:

atamazov commented 3 months ago

To me, the most promising approach is:

  • Using randomly reordered PerfConfigs at generic level together with MIOPEN_TUNING_TIME_MS_MAX.
  • We can also implement MIOPEN_TUNING_PATIENCE_TIME_MS_MAX and try different combinations.

...and AFAICS we already have the first item.