MLIR Solvers Tuning Proposal

jerryyin commented 3 years ago

Background and Motivation

Tuning has been known for providing optimal performance for a variety of algorithms. According to the study in https://github.com/ROCmSoftwarePlatform/MIOpen/issues/939, this continue to be true for cpp igemm solvers, the ones that MLIR solvers are prototyped from. To achieve the optimal performance towards important configs, we want to implement tuning for MLIR backend solvers in MIOpen. This involves three different areas:

MIOpen solvers and performance config, that supports tuning
Miir.h defined interface that supports the tuning parameter pass through
MLIR AffixTuningParams pass that can not only heuristic initialize, but can also initialize according to external passed in tuning parameters

Use Cases

The implementation of tuning support two high level use cases:

Running mode

MIOpen query the perfdb for existing config,
- If it is perfdb hit, MIOpen will pass the tuning parameters to MLIR, and MLIR will populate kernel accordingly.
- If it is perfdb miss, MIOpen will pass an empty string to MLIR, and MLIR will heuristically initialize the tuning parameters.
Tuning mode

MIOpen repeatedly populated different tuning parameters. MLIR will populate kernel according to the tuning parameter passed in.

Implementation

Tuning parameter passing mechanism

Tuning parameters will be passed through comma split-ed C strings, in exact same format as the field stored in perfdb. The string can be either Serialize() or Deserialize(), as in MIOpen::Serializable{}. This way, either MIOpen or MLIR is able to communicate the tuning parameters with one additional string field. The benefit of using this mechanism are below:

Compatible with perfdb. This approach is already validated by the extensive use of perfdb, and proves to be reliable.
Code reuse. No need to re-invent Serializable infrstructures.
Flexibility. String serialization can capture a variety of data types.

Performance Config

To enable tuning, MLIR solvers need to implement their own performance config data structures. The MLIR solver needs to implement 5 performance config functions, according to

https://github.com/ROCmSoftwarePlatform/MIOpen/blob/120289fcb33496db05518b24e3a39db01d5adb5c/src/include/miopen/generic_search.hpp#L61-L73

Alternative 1: Implement performance config in MIOpen

All the performance config implementation can be done together with the MIOpen solvers. This follows the existing practices. The implication of this is: MIOpen will be MLIR's only tuning driver, since only MIOpen's performance config have the knowledge about how to tune a MLIR solver. This implication looks acceptable as we do intend to use only MIOpen/Tuna infrastructure for tuning of MLIR solvers. In any other scenarios, the MLIR infrastructure works strictly as a compiler/kernel generator.

In this case, all we need to do is to implement two performance config class:

PerformanceConvMlirIgemm
- All fwd, bwd and wrw solvers will be using the same performance config because they all are dependent on the same tuning parameters. (As a background, since performance config entry includes the solver id, each solver can still hold different values towards the same problem)
PerformanceConvMlirIgemmXdlops
- All fwd, bwd and wrw xdlops solvers will be using the same performance config because they all are dependent on another same set of tuning parameters

Alternative 2: Implement performance config in MLIR

With alternative 1, one might have fear that the two components are coupled too closely together, because MIOpen's performance config has to have implementation detail of a MLIR solver. When a MLIR developer decides to change the way to tune its kernels, it has to make change in MIOpen.

A solution to this is to abstract the implementation details in the Miir.h interface, leaving MLIR component to be solely responsible for its tuning. It can be challenging because performance config does not dependent on any external information to make its decisions. Performance config are just data structures that encode the tuning parameters. Especially in SetNextValue(), it doesn't carry any context information at all. This makes it very hard to create a handle and let MLIR know any information that help to create the nextValue based on the current set of performance config.

We need to make a combination of changes to make it possible for performance config tuning logic placed in MLIR

1. One generic Performance Config information

In this generic implementation, all MLIR solvers can share one implementation of performance config. Under the hood, the performance config will invoke Miir interface to decide on whether the config is valid/ how to set the next config.

struct PerformanceMlirGeneric {
    std::string TuningParams;
    PerformanceConvMlirIgemmFwd(const std::string& params) : TuningParams(params) {}
    PerformanceConvMlirIgemmFwd(bool spare) : TuningParams("uninitialized_spare") {}
    PerformanceConvMlirIgemmFwd() : TuningParams("uninitialized") {}
    bool operator==(const PerformanceConvMlirIgemmFwd& other) const {
        return tuningParams == other.tuningParams;
    }
    bool IsValid(const ConvolutionContext& ctx) const {
        return MiirIsConfigApplicable(mlir::ConstructBuildOptions(ctx, TuningParams, ...));
    }
    bool SetNextValue(const ConvolutionContext& ctx) {
        std::string nextTuningParams;
        bool isEndOfSequence;
        MiirGetNextTuningParams(mlir::ConstructBuildOptions(ctx, TuningParams, ...), nextTuningParams, isEndOfSequence);
        if (isEndOfSequence) return false
        TuningParams = nextTuningParams;
        return true;
    }
}

2. ComputedIteratorGeneric

The new ComputedIteratorMLIRGeneric will inherits from the original iterator. Based on whether the performance config is of PerformanceMlirGeneric, we have a different implementation of SetNextValue() that allow us to pass in the convolution context.

do
{
    if(!v.SetNextValue(*p))
    { // Wraparound, end reached. Iterator is useless from now.
        p = nullptr;
        break;
    }
} while(!v.IsValid(*p));

3. dynamically create a MLIR container

if (dynamic_cast<PerformanceMlirGeneric>(s.GetPerformanceConfig(context))) {
    ComputedContainerMlirGeneric<PerformanceMlirGeneric, Context> main(context);
    ...
}

Side effects

Need to apply a patch to the ComputedContainer
Every PerformanceConfig class need to implement the new SetNextValue(ConvolutionContext) interface
Overhead in dynamically checking the performance config type

MLIR Implementations

For alternative 1, the only changes needed on MLIR side are:

The new handle now force the client to pass in a new field: --tuning_parameter \<params>
If <params> is empty, do its internal heuristic intialization
Otherwise, use the passed in tuning parameter

For alternative 2, in addition, it requires a couple of new APIs to implement the performance config interface:

extern "C" MiirStatus miirTuningParamsInit(MiirHandle handle, char *tuningParams, size_t *size);
extern "C" MiirStatus miirTuningParamsNext(MiirHandle handle, char *tuningParams, size_t *size, bool *isEndOfSequence);

Under the hood, the two interface function in MLIR side will invoke into the corresponding static functions. Depending whether or not this is a xdlops requrest, according to the handle, it will launch and return a new set of tuning parameters according to the one from the handle.

Summary

Considering we do plan on make MIOpen the only tuning driver for MLIR, I personally favor alternative 1 over 2 on a variety of factors:

There's less amount of changes necessary because it's relative easier to implementation.
No need to apply architectural level patches for MLIR solvers only.
Performance config and its iterator are fairly straightforward interface to implement in MIOpen and make it a lot more plumbing/overhead to implement in MLIR.

atamazov commented 3 years ago

Especially in SetNextValue(), it doesn't carry any context information at all. This makes it very hard to create a handle and let MLIR know any information that help to create the nextValue based on the current set of performance config.

Please see #388 and raise urgency level if necessary.

Considering we do plan on make MIOpen the only tuning driver for MLIR, I personally favor alternative 1 over 2 on a variety of factors...

I am for it.

jerryyin commented 3 years ago

Please see #388 and raise urgency level if necessary.

Thanks for bring that up, didn't know that has been documented. This is going to be a high priority job for me because it's going to be the largest feature in the next milestone.

I am for it.

Niiiiiice. Thanks for the review.

@JehandadKhan @whchung Any questions, concerns?

JehandadKhan commented 3 years ago

@jerryyin We have consensus since Approach 1 makes the most sense to me as well. Thank you for formalizing the ideas.

carlushuang commented 3 years ago

@jerryyin this is a good idea to parse ctx into SetNextValue. We have another scenario to utilize this new API, we can more easily to dicide the gemm_k split value based on the convolution context parsed in. currently we iterate this value based on either next pow of 2, or next multiply of CU number, without the knowledge of current conv configs.

Indeed I think there should be no harm that we just add ctx to SetNextValue's argument. If some solver do not need the ctx value, then just ignore it.

cc @shaojiewang

jerryyin commented 3 years ago

@JehandadKhan Thank you, appreciate your time for the review.

@carlushuang Agreed, I'm sure knowing more context information can be beneficial. Now that since I have the majority votes on alternative 1, we should continue the SetNextValue(ctx) refactor in its original ticket.

As the next step, I'd do the following:

By next Friday on Junly 9th, if I don't have any objections, I will mark 1st alternative as the conclusion of this proposal.
We can continue to engage in #388 and set corresponding urgency level if more developers agreed with SetNextValue refactor. I would not be an owner in that ticket as it is not the immediate priority for me.

jerryyin commented 3 years ago

Seeing no further feedbacks, I'm starting the implementation on tuning based on approach 1. I will start with the implementation of non-xdlops mlir solvers.

ppanchad-amd commented 4 months ago

@jerryyin, Is the implementation complete? If so, can you please close this ticket? Thanks!

atamazov commented 3 months ago

@ppanchad-amd This ticket is intended to document the current design (and its motivation), so it should never be closed. I recommend removing the https://github.com/ROCm/MIOpen/labels/urgency_high and https://github.com/ROCm/MIOpen/labels/value_high labels.

JehandadKhan commented 3 months ago

Perhaps we can move it to the wiki or documentation ?

On Mon, Apr 8, 2024 at 5:25 PM Artem Tamazov @.***> wrote:

@ppanchad-amd https://github.com/ppanchad-amd This ticket is intended to document the current design (and its motivation), so it should never be closed. I recommend removing the urgency_high https://github.com/ROCm/MIOpen/labels/urgency_high and value_high https://github.com/ROCm/MIOpen/labels/value_high labels.

— Reply to this email directly, view it on GitHub https://github.com/ROCm/MIOpen/issues/1018#issuecomment-2043741419, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABK5ZOL24T2CI34H36XDXTTY4MKNRAVCNFSM47RAEMIKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBUGM3TIMJUGE4Q . You are receiving this because you were mentioned.Message ID: @.***>

ppanchad-amd commented 3 months ago

@JehandadKhan Created internal ticket to assist with the documentation. Thanks!

ROCm / MIOpen