Closed cderb closed 3 months ago
@CAHEK7 GenericSearch is randomizing performance config order for the solver: https://github.com/ROCm/MIOpen/blob/b6e2e7d4342a9d3b44e307dac61562eee8a2070a/src/include/miopen/generic_search.hpp#L393-L398 So each call to generic search would look like:
Alg0
tune2
tune0
tune1
Would you propose a change to GenericSearch where the solvers are evaluated/tested in an interleaved fashion? This would require something like a re-fashioning of the function to take a list of solvers instead.
@cderb What is the difference between MIOPEN_DEBUG_TUNING_ITERATIONS_MAX and newly added MIOPEN_TUNING_PATIENCE?
@CAHEK7 GenericSearch is randomizing performance config order for the solver:
So each call to generic search would look like:
Alg0 tune2 tune0 tune1
Would you propose a change to GenericSearch where the solvers are evaluated/tested in an interleaved fashion? This would require something like a re-fashioning of the function to take a list of solvers instead.
Random shuffle inside one algorithm is better that no shuffle at all, but if the algorithm has A LOT of tuning cases and is very stable for that particular problem, then we may stick with this algorithm. But probably it is intended. But probably it's an intension to stop tuning if we've found such algorithm.
@cderb What is the difference between MIOPEN_DEBUG_TUNING_ITERATIONS_MAX and newly added MIOPEN_TUNING_PATIENCE?
MIOPEN_DEBUG_TUNING_ITERATIONS_MAX is a hard cap on the number of tuning iterations, MIOPEN_TUNING_PATIENCE is a cap on the # of iterations without improvement. If performance improves, the termination count resets.
Random shuffle inside one algorithm is better that no shuffle at all, but if the algorithm has A LOT of tuning cases and is very stable for that particular problem, then we may stick with this algorithm. But probably it is intended. But probably it's an intension to stop tuning if we've found such algorithm.
This shuffle is meant to break up the configs within the algorithm. To facilitate the random sampling of that algorithm while it is being tuned. This would make it so similar configs would have less spatial proximity. So if an env like MIOPEN_TUNING_PATIENCE or MIOPEN_DEBUG_TUNING_ITERATIONS_MAX is set it is more likely a wider range of configs are sampled.
@cderb @CAHEK7 @averinevg @junliume ~Unfortunately, the effectiveness of MIOPEN_TUNING_PATIENCE
depends on distribution of fast PerfConfigs within virtual container (which, in turn, may depend on the Problem).~
The potential issue is using the number of iterations as a limit. It puts ASM kernels at a disadvantage compared to OCL kernels, and puts OCL kernels at a disadvantage compared to HIP kernels. For example building 100 asm kernels would take ~5 sec, while building 100 HIP kernels may take 5 minutes or more. The number-of-iterations limit that works for HIP will not do any good with ASM but may unnecessarily affect ASM performance.
[Notice] That's why MIOPEN_DEBUG_TUNING_ITERATIONS_MAX
, as it name suggests, is intended for debugging/testing purposes only.
[Recommendation] Rename MIOPEN_TUNING_PATIENCE
to MIOPEN_TUNING_PATIENCE_ITERATIONS_MAX
, or, better, replace it with MIOPEN_TUNING_PATIENCE_TIME_MS_MAX
.
To me, the most promising approach is:
MIOPEN_TUNING_TIME_MS_MAX
.MIOPEN_TUNING_PATIENCE_TIME_MS_MAX
.To me, the most promising approach is:
- Using randomly reordered PerfConfigs at generic level together with
MIOPEN_TUNING_TIME_MS_MAX
.- We can also implement
MIOPEN_TUNING_PATIENCE_TIME_MS_MAX
and try different combinations.
...and AFAICS we already have the first item.
Adds environment variable MIOPEN_TUNING_PATIENCE which will allow the user to set the maximum number of performance configurations GenericSearch will iterate through without improvement before quitting.