Open zetwhite opened 3 months ago
Most of the extra tensors are used for temporary values.
(e.g) In FullyConnectedLayer, extra tensors are used to save W^T
and X^T
.
$\frac{\partial L}{\partial X}$ = fc($\frac{\partial L}{\partial Z}, W^T) $
$\frac{\partial L}{\partial W} = fc((\frac{\partial L}{\partial Z})^T, X^T)$
*Layer
implementation, not TrainableGraph. onert/backed/train
area. using ExtraTensor = Tensor;
ops
Each layer needs to be added two functions :
// backend/train/ops/FullyConnectedLayer.h
class FullyConnectedLayer : public exec::train::ITrainableFunction,
public cpu::ops::FullyConnectedLayer
{
...
// Return how many extra tensors are necessary in FullyConnected Layer
static ExtraTensorRequests requestExtraTensors(const IPortableTensor *weights,
const IPortableTensor *input,
const IPortableTensor *back_prop_output,
ir::Activation activation);
// Assign extra tensors to the tensor pointer member variable
void configureExtraTensors(std::vector<ExtraTensor *> extra_tensors);
...
}
struct ExtraTensorRequest
{
ir::OperandInfo info;
ir::Layout layout;
ExtraTensorLifeTime lifetime;
};
ExtraTensorGenerator
This generator is used inside of train::BackendContext::genTrainingTensors()
.
Visiting each operator in the graph,
ExtraTensorReqeusts
info from corresponding Layer TensorBuilder
TensorBuilder
class ExtraTensorIndex
{
ir::OperationIndex op_index;
uint32_t sub_index;
};
class TensorRegistry
{
...
std::unordered_map<ExtraTensorIndex, std::unique_ptr<ExtraTensor>> _extra;
};
TensorRegister in TensorBuilder stores (ExtraTensorIndex, ExtraTensor) pairs.
genKernels
While generating Kernels, make each layer point to each tensor from TensorRegistry.
void KernelGenerator::visit(const ir::train::operation::FullyConnected &node)
{
...
auto out_back_prop_tensor = getBackPropOut(out_index);
fn->configureBackward(in_tensor, weights_tensor, out_tensor, in_back_prop_tensor,
weights_grad_tensor, bias_grad_tensor, out_back_prop_tensor, activation,
weights_format);
// also set extra tensors
auto extra_tensors = getExtraTensors(node);
fn->configureExtraTensors(extra_tensors);
...
After some work on https://github.com/Samsung/ONE/pull/13486, I checked how much memory was reduced compared to the master branch.
master
on draft ( https://github.com/Samsung/ONE/pull/13486 )
extra tensors allocation reduces : 23187968(22.1MB) -> 6627328 (6.3 MB)
[ ALLOC ] allocation capacity: 11360128 # non-const
[ ALLOC ] allocation capacity: 1938880 # trainable
[ ALLOC ] allocation capacity: 11360000 # back-prop
[ ALLOC ] allocation capacity: 1938880 # gradient
[ ALLOC ] allocation capacity: 3877760 # opt variable
[ ALLOC ] allocation capacity: 3211264 # disposable
[ ALLOC ] allocation capacity: 6627328 # extra tensors
master
on draft ( https://github.com/Samsung/ONE/pull/13486 )
extra tensors allocation reduces : 448920128(428.1MB) -> 96350208 (91.8MB)
[ ALLOC ] allocation capacity: 361362288 # non-const
[ ALLOC ] allocation capacity: 13951408 # trainable
[ ALLOC ] allocation capacity: 361362240 # back-prop
[ ALLOC ] allocation capacity: 13951408 # gradient
[ ALLOC ] allocation capacity: 27902816 # opt variable
[ ALLOC ] allocation capacity: 49032960 # disposable
[ ALLOC ] allocation capacity: 96350208 # extra tensors
/cc @ragmani
After applying all PRs related to the draft #13305, the other allocation capacity will be reduces as follows:
33686912(32.1 MB) -> 25187648(24.0MB)
non-const : 11360128 -> 11341056
trainable : 1938880 -> 1938880
back-prop : 11360000 -> 6423808
gradient : 1938880 -> 1606144
optimizer variables : 3877760 -> 3877760
disposable : 3211264 -> 0
~827562720(789.2MB) -> 490938032(468.1MB)~ 827562720(789.2MB) -> 508592656(485.0MB)
non-const : 361362288 -> 361362240
trainable : 13951312 -> 13951312
back-prop : 361362240 -> 97241920
gradient : 13951312 -> 5124000
~optimizer variables : 27902608 -> 10248000~
optimizer variables : 27902608 -> 27902624
disposable : 49032960 -> 3010560
The capacity of optimizer variable is 27902624
, not 10248000
I'll start to make PR based on the draft. ( https://github.com/Samsung/ONE/pull/13486 ) Draft is somehow rough, I'll trim it while making a PR.
Core
Backend/train
LayerScopeTensorIndex, LayerScopeTensor
LayerScopeMemoryManager
notifyFirst(LayerScopeTensorIndex&)
notifyLast(LayerScopeTensorIndex&)
.. etc genKernels
to register, plan, allocate extra tensors
Background
backendtrain/ops/*Layer
has extra(auxiliary) tensors used forbackward()
.For example,
https://github.com/Samsung/ONE/blob/60683ad7293d2a18a7939dc49bcea77d8e09b352/runtime/onert/backend/train/ops/FullyConnectedLayer.h#L61-L65
https://github.com/Samsung/ONE/blob/60683ad7293d2a18a7939dc49bcea77d8e09b352/runtime/onert/backend/train/ops/ConvolutionLayer.h#L56-L60
These tensors are allocated when
KernelGenerator
visits each operation.https://github.com/Samsung/ONE/blob/60683ad7293d2a18a7939dc49bcea77d8e09b352/runtime/onert/backend/train/ops/FullyConnectedLayer.cc#L88-L96
What
These auxiliary tensors always hold memory after being configured. So, Adding these tensors into TensorBuilder to use a memory planner might be helpful.