Looking at running various models with various inputs - it seems a lot of time for the initial runs is being spent benchmarking potential kernels - including the naive ones (e.g. naive_conv_nonpacked_fwd_nchw_float_double_float)
The solution that comes up usually is not the naive one, but one of the other kernels. Running with MIOPEN_DEBUG_CONV_DIRECT=0 significantly speeds up initial runs of said model with varying resolutions.
Would it be an option to get this testing / benching dynamically, without excluding it completely? Where the naive kernel would be the least preferred - and if another is found it would be a safe bet to say the other implementation is faster (so the testing of the kernel itself could be skipped alltogether)
If its not desired behaviour - maybe this could be added behind a feature flag.
I'm quite sure that people running this without knowing about it, would experience major speedups in initial runs (the test case here is various VAE models being ran).
Hi,
Looking at running various models with various inputs - it seems a lot of time for the initial runs is being spent benchmarking potential kernels - including the naive ones (e.g.
naive_conv_nonpacked_fwd_nchw_float_double_float
)The solution that comes up usually is not the naive one, but one of the other kernels. Running with
MIOPEN_DEBUG_CONV_DIRECT=0
significantly speeds up initial runs of said model with varying resolutions.Would it be an option to get this testing / benching dynamically, without excluding it completely? Where the naive kernel would be the least preferred - and if another is found it would be a safe bet to say the other implementation is faster (so the testing of the kernel itself could be skipped alltogether)
If its not desired behaviour - maybe this could be added behind a feature flag.
I'm quite sure that people running this without knowing about it, would experience major speedups in initial runs (the test case here is various VAE models being ran).