Samsung / ONE

On-device Neural Engine
Other
429 stars 157 forks source link

[onert/train] Attach auxiliary tensors to tensor builder #13282

Open zetwhite opened 3 months ago

zetwhite commented 3 months ago

Background

backendtrain/ops/*Layer has extra(auxiliary) tensors used for backward().

For example,

https://github.com/Samsung/ONE/blob/60683ad7293d2a18a7939dc49bcea77d8e09b352/runtime/onert/backend/train/ops/FullyConnectedLayer.h#L61-L65

https://github.com/Samsung/ONE/blob/60683ad7293d2a18a7939dc49bcea77d8e09b352/runtime/onert/backend/train/ops/ConvolutionLayer.h#L56-L60

These tensors are allocated when KernelGenerator visits each operation.

https://github.com/Samsung/ONE/blob/60683ad7293d2a18a7939dc49bcea77d8e09b352/runtime/onert/backend/train/ops/FullyConnectedLayer.cc#L88-L96

What

These auxiliary tensors always hold memory after being configured. So, Adding these tensors into TensorBuilder to use a memory planner might be helpful.

zetwhite commented 3 months ago

Why do Extra tensors become necessary?

Most of the extra tensors are used for temporary values. (e.g) In FullyConnectedLayer, extra tensors are used to save W^T and X^T.

$\frac{\partial L}{\partial X}$ = fc($\frac{\partial L}{\partial Z}, W^T) $

$\frac{\partial L}{\partial W} = fc((\frac{\partial L}{\partial Z})^T, X^T)$

Note


How


ops

Each layer needs to be added two functions :

// backend/train/ops/FullyConnectedLayer.h
class FullyConnectedLayer : public exec::train::ITrainableFunction,
                            public cpu::ops::FullyConnectedLayer
{
    ... 
  // Return how many extra tensors are necessary in FullyConnected Layer
  static ExtraTensorRequests requestExtraTensors(const IPortableTensor *weights,
                                                 const IPortableTensor *input,
                                                 const IPortableTensor *back_prop_output,
                                                 ir::Activation activation);

  // Assign extra tensors to the tensor pointer member variable 
  void configureExtraTensors(std::vector<ExtraTensor *> extra_tensors);
  ... 
}
struct ExtraTensorRequest
{
  ir::OperandInfo info;
  ir::Layout layout;
  ExtraTensorLifeTime lifetime;
};

ExtraTensorGenerator

This generator is used inside of train::BackendContext::genTrainingTensors(). Visiting each operator in the graph,


TensorBuilder

class ExtraTensorIndex
{
  ir::OperationIndex op_index;
  uint32_t sub_index;
};
class TensorRegistry
{
  ... 
  std::unordered_map<ExtraTensorIndex, std::unique_ptr<ExtraTensor>> _extra; 
};

TensorRegister in TensorBuilder stores (ExtraTensorIndex, ExtraTensor) pairs.


genKernels

While generating Kernels, make each layer point to each tensor from TensorRegistry.

void KernelGenerator::visit(const ir::train::operation::FullyConnected &node)
{
    ... 
    auto out_back_prop_tensor = getBackPropOut(out_index);

    fn->configureBackward(in_tensor, weights_tensor, out_tensor, in_back_prop_tensor,
                          weights_grad_tensor, bias_grad_tensor, out_back_prop_tensor, activation,
                          weights_format);

    // also set extra tensors
    auto extra_tensors = getExtraTensors(node);
    fn->configureExtraTensors(extra_tensors);
    ... 
zetwhite commented 2 months ago

After some work on https://github.com/Samsung/ONE/pull/13486, I checked how much memory was reduced compared to the master branch.

mnist

    [      ALLOC     ] allocation capacity: 11360128 # non-const 
    [      ALLOC     ] allocation capacity: 1938880  # trainable 
    [      ALLOC     ] allocation capacity: 11360000 # back-prop 
    [      ALLOC     ] allocation capacity: 1938880  # gradient
    [      ALLOC     ] allocation capacity: 3877760  #  opt variable
    [      ALLOC     ] allocation capacity: 3211264  # disposable 
    [      ALLOC     ] allocation capacity: 6627328  # extra tensors 

mobile net v2

    [      ALLOC     ] allocation capacity: 361362288 # non-const
    [      ALLOC     ] allocation capacity: 13951408  # trainable 
    [      ALLOC     ] allocation capacity: 361362240 # back-prop
    [      ALLOC     ] allocation capacity: 13951408  # gradient 
    [      ALLOC     ] allocation capacity: 27902816  # opt variable 
    [      ALLOC     ] allocation capacity: 49032960  # disposable
    [      ALLOC     ] allocation capacity: 96350208  # extra tensors 

/cc @ragmani

ragmani commented 2 months ago

After applying all PRs related to the draft #13305, the other allocation capacity will be reduces as follows:

mnist

33686912(32.1 MB) -> 25187648(24.0MB)

non-const : 11360128 -> 11341056
trainable : 1938880 -> 1938880
back-prop : 11360000 -> 6423808
gradient : 1938880 -> 1606144
optimizer variables : 3877760 -> 3877760
disposable : 3211264 -> 0

mobile net v2

~827562720(789.2MB) -> 490938032(468.1MB)~ 827562720(789.2MB) -> 508592656(485.0MB)

non-const : 361362288 -> 361362240
trainable : 13951312 -> 13951312
back-prop : 361362240 -> 97241920
gradient : 13951312 -> 5124000
~optimizer variables : 27902608 -> 10248000~
optimizer variables : 27902608 -> 27902624
disposable : 49032960 -> 3010560

The capacity of optimizer variable is 27902624, not 10248000

zetwhite commented 2 months ago

I'll start to make PR based on the draft. ( https://github.com/Samsung/ONE/pull/13486 ) Draft is somehow rough, I'll trim it while making a PR.

TODO