More fine-grained operator implementation dispatch & memory planning flow

eric-haibin-lin commented 5 years ago

Existing Execution Flow

g = graph()
shapes = g.infer_shape()
types = g.infer_type()
storage_types, dispatch_modes = g.infer_storage_type()
memory_plan = nnvm::plan_memory() // which calls node.finplace_option(node.attr)
for node in g:
  fcompute = get_fcompute(node)
  fcompute(x)

The drawback of the existing flow

the selection of MKL/CPU/GPU/CUDNN implementation happens after graph attribute inference and memory planning, memory planning is thus not aware of the implementation that will be used for execution in the future, which may result in sub-optimal result. For example, the memory inplace option may vary depending on the accelerator backend (the new version of CUDNN enables x/dx inplace for _backward_conv).
some sparse operator need to access dtype/shape information to decide which implementation to invoke for execution, and whether to perform fallback. This information is not yet exposed in the existing infer storage type interface.

Alternative Flow

op implementations

CovolutionComputeCUDNN(const nnvm::NodeAttrs& attrs,
                       const OpContext& ctx,
                       const std::vector<TBlob>& inputs,
                       const std::vector<OpReqType>& req,
                       const std::vector<TBlob>& outputs) {
  // 1st CUDNN implementation goes here 
}

CovolutionComputeMKL(const nnvm::NodeAttrs& attrs,
                       const OpContext& ctx,
                       const std::vector<NDArray>& inputs,
                       const std::vector<OpReqType>& req,
                       const std::vector<NDArray>& outputs) {
  // MKL implementation goes here 
}

CovolutionComputeImplGPU(const nnvm::NodeAttrs& attrs,
                       const OpContext& ctx,
                       const std::vector<TBlob>& inputs,
                       const std::vector<OpReqType>& req,
                       const std::vector<TBlob>& outputs) {
  // GPU implementation goes here 
}

CovolutionComputeImplCPU(const nnvm::NodeAttrs& attrs,
                       const OpContext& ctx,
                       const std::vector<TBlob>& inputs,
                       const std::vector<OpReqType>& req,
                       const std::vector<TBlob>& outputs) {
  // CPU implementation goes here 
}

new finferstorage_ex interface

void  FInferStorageTypeEx(const std::vector<TShape>& in_shapes,
                          const std::vector<int>& in_types,
                          const std::vector<int>& in_stype,
                          const std::vector<TShape>& out_shape,
                          const std::vector<int>& out_type,
                          std::vector<int>* out_stype,
                          int dev_mask,
                          NodeAttrs* attrs, // mutable
                          DispatchMode* dispatch_mode // mutable) {
    // GPU
    if (dev_mask == kGPU) {
      out_stype[0] = kDefaultStorage;
      dispatch_mode = kFCompute;
#if MXNET_USE_CUDNN
      if (attrs.params.kernel.ndim() == 2 && dtype == float && in_shape[0].ndim() == 1 && …) {
        attrs.exec_func = CovolutionComputeImplCUDNN;
      } else {
        attrs.exec_func = CovolutionComputeImplGPU;
      }
#else 
    attrs.exec_func = CovolutionComputeImplGPU;
#endif
    // CPU
    } else {
#if MXNET_USE_MKLDNN
    attrs.exec_func = CovolutionComputeMKL
    out_stype[0] = kDefaultStorage;
    dispatch_mode = kFComputeEx;
#else
    attrs.exec_func = CovolutionComputeCPU
    ...
#endif
}

FInplaceOption for convolution

[] FInplaceOption(const NodeAttrs& attrs) {
  if (attrs.exec_func == CovolutionComputeCUDNN) {
    return {0,0};
  } else {
    return {}
  }
}

New Execution Flow:

g = graph()
shapes = g.infer_shape()
types = g.infer_type()
if (g.has_attr('FInferStorageTypeEx')) {
  storage_types, dispatch_modes = g.infer_storage_type_ex()
} else {
  storage_types, dispatch_modes = g.infer_storage_type()
}
memory_plan = nnvm::plan_memory() // which calls node.finplace_option(node.attr)
for node in g:
  if (node.attrs.exec_func) {
    fcompute = node.attrs.exec_func
  } else {
    fcompute = get_fcompute(node)
  }
  fcompute(x)

@DickJC123 @ptrendx @piiswrong @reminisce

pengzhao-intel commented 5 years ago

Great, the new proposal will resolve several issues we encountered during the developments for MKL-DNN backend. We will have some internal discussions first and give you back for more details 👍 @TaoLv @ZhennanQin

marcoabreu commented 5 years ago

Hey Haibin,

I really like the direction your proposal is going! Instead of using preprocessor statements, would it be possible to make use of class hierarchies and abstraction to achieve the same goal? Especially the GPU preprocessor flags and storage types strike me as a problem here - imagine we start support another accelerator, like AMD GPUs or other backends that also need a special case.

In general, I think it would be great if we could abstract the task (e.g. InferStorageType or CovolutionCompute) from the backend implementation (CovolutionComputeCUDNN, CovolutionComputeMKLDNN, etc). That way, we could extend MXNet with as many accelerator backends as we want without having to modify the logic. At the moment, there's a strong coupling which causes issues for the maintenance.

@DickJC123 already started something towards that direction. Maybe you can provide a bit input here.

zhaoyao73 commented 5 years ago

agreed. It is a good idea but looks like a hack.

eric-haibin-lin commented 5 years ago

@marcoabreu thanks for the comments. True that the existing infer_storage interface and the proposed infer_storage_ex interface both need to write backend specific logics. What kind of abstraction would you like to see? Let's say each backend provides one implementation which only concerns that backend itself. Now how does MXNet provide a general guide to select/prioritize these implementations if it is built with MKLDNN+CUDA+AMDHIP? What order would you propose to invoke these functions, and what if one of them conflicts with other backends? How does MXNet resolve these conflicts on a per operator basis? I do want to limit the discussion to memory planning itself so that @DickJC123 's work on NHWC can be unblocked as soon as possible.

marcoabreu commented 5 years ago

Thanks for your very good questions!

For the operator selection I would think about a design which has something similar to a "tuning" or warm-up stage which evaluates the different possibilities. Initially, since that revamp would be quite big and experimental, I would hardcode an order (e.g. CUDA->AMD->MKLDNN->CPU) which is then evaluated and certain backends dropped if they don't support that operator or they're simply not present. Later on, there would ideally be a benchmark step which evaluates the different possibilities and then chooses the most efficient representation of the graph. This evaluation would first start with simple benchmarks (with different strategies like memory footprint, power consumption, throughput, etc) of each operator backend and then in the next stage go one level higher and evaluate groups of operators (up to evaluating the entire graph) to accomodate for layout conversion and memcopy overhead. In the last iteration, we would have a graph which is most efficienct, but also runnable on that hardware, for the requested graph.

There are two ways I could think of backends conflicting:

Mismatching memory layouts
Impossible/unlikely combinations (CUDA &AMDHIP or MKL &ARM)

To solve number one, I would extend the design to not only have the operators abstracted, but also their memory layouts. In the same way as we would have an operator registry, we would have a memory layout registry where each backend announces their memory layouts (this could be rearrange data or moving them to different memory slots like GPU mem) as well as converters. Each operator implementation would specify a desired layout (most likely the one they registered themselfes). Now imagine you have a graph with threeoperators:

Input -> Operator1_CUDA -> Operator2_MKL -> Operator3_MKL -> Output

These three operators are from two entirely different backends and have their own implementation and memory layouts. Our engine would, during the initial analysis of the graph (this step is after the optional graph optimization and we assume the graph as final at that point), analyse the desired layout of each operator (in this case CUDA and MKL, but it could also go a level deeper like CUDA_NHWC etc) and then see whether they are compatible. If they are not, the engine would request a converter from the memory layout registry. These converters would then be inserted into the graph and the final graph would look as follows:

Input -> Convert_Standard_CUDA -> Operator1_CUDA -> Convert_CUDA_MKL -> Operator2_MKL -> Operator3_MKL -> Convert_MKL_Standard -> Output

This way, you will always have compatibility in between the different layouts while the neither the operators nor the engine will have to care about the different backends as that conversion happens in between. When an operator receives and outputs data, it expects to be in its "isolated" world. If the operators are from the same backend and use the same layout though, this conversion is skipped and a performance advantage is achieved. Now at this point you could get to O(N!) if you need convertors in between every single possible memory layout. The trick here is to have a standard layout (which we basically already have and is used to input and output data from the graphs). Each memory layout has to register at least two converters: TO_STANDARD and FROM_STANDARD. This allows have compatibility for backends where no direct conversion exists. Since this will require two conversions (FROM_MEMLAYOUT1_TO_STANARD and FROM_STANDARD_TO_MEMLAYOUT2), this will have additional overhead but keep compatibility high. For common cases, where would probably be direct converters.

For the second case where conflicting backends exist, they would simply be skipped during the evaluation stage when the engine checks whether an operator is actually eligible. So if CUDA is not present, the operator will simply not be considered for that graph.

The optimization I talked about in the first paragraph could be developed in three stages:

Simple hardcoded priority (maybe with ENV var)
Pre-run benchmark that returns the most optimal graph. This will increase the startup duration.
Background optimization: Immediately serve requests, but slightly modify the graph every now and then in order to approach the most optimal graph. This will slightly increase the initial latency (due to initial suboptimal operator choice) but will result in the most efficienct graph in the end as well

This optimization could either be done as part of the run or also run separately (imagine a CD pipeline) and then deployed together with the model to avoid having to re-do this optimization all the time.

I hope my prosa makes my idea a bit clearer. If you would like, I'm happy to draw a small diagram

marcoabreu commented 5 years ago

Just had a small chat with Haibin. So just to clarify, my idea would be rather long-term to avoid having all the preprocessor directivesin the FInferStorageTypeEx and similar places.

Within the above design, FInferStorageTypeEx would be part of the abstract operator interface which each backend would implement. The memory layout manager would then invoke that function in the same fashion as described by Haibin but the implementation would be in different places.

I totally see that my proposal is way out of scope for the current problem and agree that Haibins method is definitely the best way to go considering the constraints of the current design around preprocessor directives.

Just wanted to write down my idea for future cases when the operator implementation design might get revisited.

pengzhao-intel commented 5 years ago

Regarding memory planning, there are different requirements from GPU and CPU. Looks like the GPU wants to support both NCHW and NHWC, but CPU with MKL-DNN prefer the padded memory (GPU still needs this in some cases).

@eric-haibin-lin do you mind elaborate the possible change of "memory planning"? How the new change cover the diversity of memory requirement from the different backend?

We're open to the changes and happy to work together :)

ZhennanQin commented 5 years ago

@eric-haibin-lin Thanks for proposing this. We like this change as it can benefit MKLDNN subgraph and memory planing as well. Here's only one small comment: attrs.exec_func = CovolutionComputeImplCUDNN; Can we have different labels for different backends to make backend information more generic. Then we can read this label and make different upon it.

DickJC123 commented 5 years ago

I looked this over and concluded it's a complicated issue, so I'm not ready to take a strong stand. Some thoughts though:

One desire of mine would be to have an op implementation be a slight tweek of the existing "base op" without copying the entire op description. Also, whatever is decided on here should play well with the existing op registration mechanism and the subgraph api.

I haven't studied subgraph frankly, but I'd hope one could make implementation decisions (and graph alterations) based on all the information normally provided the forward/backward op calls (so context, shapes, dtypes, stypes, etc.). So during subgraph graph passes, would there be a lightweight way to swap in a tweeked operator?

So, for discussion, what about:

NNVM_REGISTER_OP(Convolution_CUDNN_Impl)
.clone_of(Convolution)                          // clones with a new name the Convolution op
.override_attr<FInplaceOption>("FInplaceOption",
    [] FInplaceOption(const NodeAttrs& attrs) { return {0,0}; })

Then use the subgraph API to swap in a Convolution_CUDNN_Impl node for a Convolution node if the parameters, gpu arch, etc. supported it? Alternatively, in a subgraph API graph pass, could one keep the node, but attach an attribute that preempts the default FInplaceOption (or overwrites the FInplaceOption function pointer used by the node directly)?

DickJC123 commented 5 years ago

Again, one goal of this is to not tie all operator implementations together. It's hard for one committer to have the knowledge and energy to modify and create tests for all implementations. Now, trying for the simplest solution, suppose we just segment the cpu implementations from the gpu ones? Right now, this is already done with FCompute. The corresponding new code might be:

.set_attr<FInplaceOption>("FInplaceOption<gpu>", [] (const NodeAttrs& attrs) { return {0,0}; })
.set_attr<FInplaceOption>("FInplaceOption<cpu>", [] (const NodeAttrs& attrs) { return {}; })

Alternatively, since AssignContext() is run before g = nnvm::ApplyPass(g, "PlanMemory"), we could additionally divide up and attach the dev_mask information that is currently an attribute of the entire graph to each node. Code then would be:

.set_attr<FInplaceOption>("FInplaceOption", [] (const NodeAttrs& attrs) {
    bool is_gpu = attrs.dict["dev_mask"] == mshadow::gpu::kDevMask;
    return is_gpu ? {0,0} : {};
})

anirudh2290 commented 5 years ago

This is a good proposal for separator of operator implementations for different accelerators. One thing that can be done to lessen the overhead of cross acc implementation in FInferStorageEx is registration of supports functions for different accelerators.

Something like the following:

.set_attr<SupportsAccl>("MKLDNN", IsMKLDNNSupported)
.set_attr<SupportAccl>("CUDNN", IsCUDNNSupported)

The default implementation for FInferStorageEx will then be:

void  FInferStorageTypeEx(const std::vector<TShape>& in_shapes,
                          const std::vector<int>& in_types,
                          const std::vector<int>& in_stype,
                          const std::vector<TShape>& out_shape,
                          const std::vector<int>& out_type,
                          std::vector<int>* out_stype,
                          int dev_mask,
                          NodeAttrs* attrs, // mutable
                          DispatchMode* dispatch_mode // mutable) {
    // GPU
    if (dev_mask == kGPU) {
      out_stype[0] = kDefaultStorage;
      dispatch_mode = kFCompute;
#if MXNET_USE_CUDNN
      if (SupportsCUDNN(...)) {
        attrs.exec_func = CovolutionComputeImplCUDNN;
      } else {
        attrs.exec_func = CovolutionComputeImplGPU;
      }
#else 
    attrs.exec_func = CovolutionComputeImplGPU;
#endif
    // CPU
    } else {
#if MXNET_USE_MKLDNN
    if (SupportsMKLDNN(...)) {
    attrs.exec_func = CovolutionComputeMKL
    out_stype[0] = kDefaultStorage;
    dispatch_mode = kFComputeEx;
    } else {
        attrs.exec_func = ConvolutionComputeCPU
     }
#else
    attrs.exec_func = CovolutionComputeCPU
    ...
#endif
}

This way the devs working on different accelerators for most cases, will only have to worry about registering FCompute and Supports implementations.

eric-haibin-lin commented 5 years ago

@reminisce @junrushao1994 maybe we probably to include this change in the next major refactoring/design

junrushao commented 5 years ago

This will be great step towards a clearer definition of operators! Nice proposal!

danithaca commented 5 years ago

We observed poor inference latency with MKL-DNN (see https://discuss.mxnet.io/t/poor-inference-latency-with-mkldnn-on-cpu/3339). And maybe the proposal discussed here could help address it. Thought it might be helpful to cross share the use case here.

pengzhao-intel commented 5 years ago

@danithaca Tao has answered the question in the forum :) thanks to raising the question.

PawelGlomski-Intel commented 3 years ago

@pengzhao-intel In #13749 you mentioned, that you were working on this issue. Maybe you remember where the work ended, was there any conclusion on this matter?

We have memory allocation problems when oneDNN selects implementations that use blocked formats (which require more memory).

apache / mxnet