Samsung / ONE

On-device Neural Engine
Other
426 stars 151 forks source link

[onert] CPU backend getTensorShape overhead #4544

Open hseok-oh opened 3 years ago

hseok-oh commented 3 years ago

@wateret (beyond this PR)

Please note that getTensorShape overhead is significant for models which has a lot of op, no 1~2 dominant kernel. It would be good if we can remove the overhead of getTensorShape. I think the complex hierarchy hinders the compiler's optimization. Do you have any idea or suggestion?

Originally posted by @glistening in https://github.com/Samsung/ONE/pull/4538#issuecomment-705253011

wateret commented 3 years ago

It does not seem like there is a way doing it in configure. So I think reducing getTensorShape overhead is pretty much all I can come up with. To do so, we may define a raw pointer protocol for shape info.

Let me give a rough example,

Define a raw pointer format. A simple vector, size(rank) comes first. Say we have a pointer ptr with size (4 * (RANK+1) bytes).

Then introduce void* ITensor::getRawShape(). And make cker::Shape just a wrapper of this pointer.

Plus, it might be further optimized if the pointer(void *) could be the object itself and use reinterpret_cast.

krayzemli commented 3 years ago

Since we know that cpu backend always gets backend::cpu_common::Tensor as its input, we could make a static cast for each tensor first and then get a constant reference to its _info field with a special Tensor-specific method, and use it directly. Unificating ir::Shape and cker::Shape the conversion overhead may be reduced even more.

I don't get the reason why our ITensor interface is so abstract, with an individual getter for every thing, when we could just keep some general tensor descriptor structure inside common for all tensor types and return a constant reference to it with a single call, probably even non-virtual. That is an approach used in MNN, for example.

krayzemli commented 3 years ago

Besides, getTensorShape is not the only problem which introduces tensor size specific preprocessing overhead. Since dynamic tensors are not common for neural networks, we better make some onTensorInfoChanged method in addition to configure() and make all necessary preprocessing there, including caching necessary tensor shapes. When tensor shape, data type, layout, or something like that changes, some mechanism would trigger onTensorInfoChanged method call of an affected operator before calling run()

wateret commented 3 years ago

@krayzemli

Since we know that cpu backend always gets backend::cpu_common::Tensor as its input, we could make a static cast for each tensor first and then get a constant reference to its _info field with a special Tensor-specific method, and use it directly. Unificating ir::Shape and cker::Shape the conversion overhead may be reduced even more.

Yes, that could be a way. But please also note that it gets IPortableTensor, not cpu_common::Tensor. This was introduced to support inter-backend uses of cpu-based backend tensors. But it could be changed.

I don't get the reason why our ITensor interface is so abstract, with an individual getter for every thing,

That's all because we separate backends as plugins. If we had cpu backend only, it would have been much simpler. We first started this project binding with ARM ComputeLibrary as the primary backend and then extended it to support other kernels(cpu) so we introduced abstract ITensor. It just look like ComputeLibrary's ITensor.

when we could just keep some general tensor descriptor structure inside common for all tensor types and return a constant reference to it with a single call, probably even non-virtual. That is an approach used in MNN, for example.

That's somewhat I meant by https://github.com/Samsung/ONE/issues/4544#issuecomment-705304811 , but keeping current structure. It would be great to do that with structural changes if it keeps the backend generality too.

If you would like to talk about something specific, you may talk to me directly as well. 😄

wateret commented 3 years ago

Since dynamic tensors are not common for neural networks,

I respectfully disagree. I think these days dynamic tensors are getting popular. And we are working on many dynamic tensor models.

we better make some onTensorInfoChanged method in addition to configure() and make all necessary preprocessing there, including caching necessary tensor shapes. When tensor shape, data type, layout, or something like that changes, some mechanism would trigger onTensorInfoChanged method call of an affected operator before calling run()

I'm not sure I get it right, I left some comments:

krayzemli commented 3 years ago

If I understand the dynamic tensor logic right, all dynamic shape input tensors should return true on the call to is_dynamic. Once Hence, it is possible to call is_dynamic for every input and update corresponding getTensorShape cache only when it returns true. Moreover, static shape inferer would always set dynamic flag of the op output tensor once any of op input tensors are dynamic. So, it may be enough to check only that the output tensor returns false on is_dynamic to be sure all the input tensors are also static. Also it is possible to assume that once a tensor has been set to dynamic, it will never be static again. Static tensor shapes are also alreayd known when configure() is run.

I'm going to try that approach in binary ops: once output is_dynamic returns false in run(), I will reuse results of getTensorShape and ProcessBroadcastShape previously cached in configure(). What do you think? Would that be the correct logic?

glistening commented 3 years ago

Also it is possible to assume that once a tensor has been set to dynamic, it will never be static again. Static tensor shapes are also alreayd known when configure() is run.

cc @hyunsik-yoon

@krayzemli I am not sure it is always right. We sometimes run the same model with different input.

I'm going to try that approach in binary ops: once output is_dynamic returns false in run(), I will reuse results of getTensorShape and ProcessBroadcastShape previously cached in configure(). What do you think? Would that be the correct logic?

I was also thinking of caching ProcessBroadcastShapes's return value (boolean) in BinaryArithmeticOpParam after #4549 is landed. If given model is proved static-shape-only, we don't need to repeat same things.

I will reuse results of getTensorShape and ProcessBroadcastShape previously cached in configure()

Please see BinaryArithmeticOpParam in compute/cker/include/cker/Types.h. As I understand, broadcast shapes are already cached in broadcast_shape[5].

hyunsik-yoon commented 3 years ago

@krayzemli,

With some models with which app does not call nnfw_set_input_tensorinfo(..), caching approach will work. In such case, input shape is always same and all tensor's is_dynamic() returns false.

For models with dynamic tensors, there are some tricky cases:

// assume that a model's tflite input shape is [1, 100]

// run A). in this case, input_tensor.is_dynamic() == false
nnfw_prepare();
nnfw_set_input(); nnfw_set_output();
nnfw_run(); // run with input shape [1, 100]

// run B) with input shape [2,100], in this case, input_tensor.is_dynamic() == true
nnfw_prepare();
nnfw_set_input_tensorinfo([2,100]);
nnfw_set_input(); nnfw_set_output();
nnfw_run();

// run C) with input shape [2, 100] --> in this case, input_tensor.is_dynamic() == true
..
// run Z) with input shape [1, 100] --> in this case, input_tensor.is_dynamic() == false
..

In this case, during run B) and C), the hit ratio of tensor shape cache at run A) will be 0%. Then in run Z) the ratio will be 100%. Since it is a cache, if we add new shapes for run B), the hit ration of run C) and Z) will be 100% but the size of cache will grow. So we need some strategy to maintain the cache.

Another case to consider is WHILE loop. input shape of an op in WHILE loop can be changed over iteration. So caching information also includes, e.g., iteration count.

You mention,

Also it is possible to assume that once a tensor has been set to dynamic, it will never be static again. Static tensor shapes are also alreayd known when configure() is run.

I am not sure if I understand you correctly, but maybe there is a case of run C) -> run Z) in the above example.

I'm going to try that approach in binary ops: once output is_dynamic returns false in run(), I will reuse results of getTensorShape and ProcessBroadcastShape previously cached in configure(). What do you think? Would that be the correct logic?

If a model always runs without dynamic tensors, I think you're correct. However, considering dynamic tensor situation is complex.

For optimization, we may need some API or env variable limiting such complex situation due to dynamic tensor. e.g., (export ALWAYS_STATIC_TENSOR=1)

wateret commented 3 years ago

@krayzemli I get what you are trying to do, and it would be grateful for that. However I am not sure it would work out.

By the way, the thing that you mentioned:

Since we know that cpu backend always gets backend::cpu_common::Tensor as its input, we could make a static cast for each tensor first and then get a constant reference to its _info field

Rather than the cache work, wouldn't this solve almost everything? I think with that there is almost no overhead so we wouldn't need cache stuff.

wateret commented 3 years ago

@krayzemli @hyunsik-yoon Back to the cache work, is it possible to check the cache is invalidated or not? I'm not sure if is_dynamic can be used that way. It is also used for pool-based(static)/new-delete(dynamic) memory management.

krayzemli commented 3 years ago

@wateret

I'm not sure if is_dynamic can be used that way.

is_dynamic can't be used to track shape changes for now, but still (given I don't miss something) it can be used to detect shapes that never changed since configure(), and so at least we can benefit from caching static shapes only and invalidate the cache once they have changed. And it covers the case of you-know-what-network with oversized graph. I've got more than 5% performance speedup with caching and will issue a PR soon.

Rather than the cache work, wouldn't this solve almost everything? I think with that there is almost no overhead so we wouldn't need cache stuff.

Tensor shape may be not the only thing we would want to cache. For example, ProcessBroadcastShapes fills a structure we would want to cache as well.

krayzemli commented 3 years ago

PR #4611

glistening commented 3 years ago
PR base #4614 #4611 + #4614 #4611
xRT 0.867 0.846 0.806 0.821
diff 0 0.021 0.061 0.046

I would like to apply those two PRs since they are exclusive.

wateret commented 3 years ago

is_dynamic can't be used to track shape changes for now, but still (given I don't miss something) it can be used to detect shapes that never changed since configure(), and so at least we can benefit from caching static shapes only and invalidate the cache once they have changed. And it covers the case of you-know-what-network with oversized graph. I've got more than 5% performance speedup with caching and will issue a PR soon.

Good point. Sounds reasonable. 😄

Rather than the cache work, wouldn't this solve almost everything? I think with that there is almost no overhead so we wouldn't ?need cache stuff.

Tensor shape may be not the only thing we would want to cache. For example, ProcessBroadcastShapes fills a structure we would want to cache as well.

I get what you meant. However as I'm more interested in dynamic models, I still think it would be worthwhile doing this, as your work is focused on static models only. And tensor shapes are something that most operation kernels are interested.

hyunsik-yoon commented 3 years ago

In https://github.com/Samsung/ONE/issues/4544#issuecomment-707460292, run A) and run Z) will be changed to use dynamic input tensor. Please refer to https://github.com/Samsung/ONE/issues/4625 (still indiscussion).

This means that once nnfw_set_input_tensorinfo() is called next calls of nnfw_run()s will be always dynamic.

krayzemli commented 3 years ago

Since each tensor is typically used twice (once as an output and once as an input), caching buffer() and getTensorShape() calls at the output point for reusing in the input points in the same op sequence would save about half virtual calls. One way to do this it is to wrap IPortableTensor abstract tensor into CPU backend specific wrapper with a non-virtual interface. Those intermediate tensors whose lifetime is limited within the op sequence may be completely non-abstract in such a case.