Autotuning (initial) support

Autotuning refers to the ability to turn various configuration pipeline knobs, and find the combination of knobs that produces the fastest execution of the pipeline.

Such "knobs" include (this list is not exhaustive):

datatype precisions, including mixed precision pipelines (e.g. start with fp16, then quantize to int8, and then further down quantize to int4).
Layer data layouts (e.g. NCHW, NHWC, but also NCHWc), including per layer layouts.
Fusions.
Grid launch params and other params (e.g. LDS size) for kernels implemented in migraphX.

Setting the various configuration knobs can have a variety of inputs:

User input, to enable/disable various tuning dimensions, or explicitly set params.
Heuristics (e.g. always fuse activation to convolution when available).
Benchmarking, where a set of configuration parameters is benchmarked, and the best combination is chosen. This benchmarking can either be exhaustive, or algorithmically driven, via constrained optimization techniques.

Initial goal: To implement the first baby step to get our feet wet with autotuning, and then can eventually be generalized in various ways.

Assumptions:

No mixed precision pipelines, no mixed layouts, for initial version. Very small search space of possibilities in tuning. This initial autotuning implementation is really just a warmup step for fancier things to follow later. The main problems to be solved correctly in this initial version are: a. how to drive the process of turning the tuning knobs correctly, and: b. how to write and read tuning json files.
No need to formalize benchmarking in a perf DB.

Proposal:

Consider realtime resnet50 inference as the guinea pig for autotuning. Have MIOpen support out of the box resnet50 in the following forms: int8/fp16 (2 options), NCHW/NHWC (2 options), ETA for those MIOpen feautures is rocm 4.5. Therefore each layer will have 2x2=4 correctly tuned configurations (for imagenet) in miopen db.
MIGraphX driver will support arguments to set datatypes and layouts globally (not layer dependent), as follows: a. Set explicit datatypes Fp16, int8 globally, for the whole pipeline, globally, or a wildcard option, like “pick_global” that will try all options (again assuming that datatypes don’t change per layer). Obviously int8 (quantization) should always win, for globally defined datatypes. b. Set layout, among the previous layouts, globally, for the whole pipeline, explicitly, or a wildcard option, like “pick_global”, that will try all layouts and find the best global (i.e. not layer dependent) layout. c. Any option that is specified as “pick_global”, it will enable autotuning. All enabled options will be tried, so for example if pick_global is turned on for both datatypes and layouts, then 4 possibilities will be tried (in the example above). d. For all enabled autotuning options, there will be a json file generated, that will hold runtime info per tuning config, for all layers, and also for the whole pipeline, and also specifically tag the winner tuning config. That json file can also used as a command line argument at runtime, to circumvent the tuning process, and just set the correct knobs.

ROCm / AMDMIGraphX

Autotuning (initial) support #862