PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.17k stars 5.56k forks source link

Any way to check whether a DenseTensor's memory can be reused or not inside a PHI kernel? #53844

Closed YangQun1 closed 4 months ago

YangQun1 commented 1 year ago

请提出你的问题 Please ask your question

Inside a phi kernel, we may want to reuse the memory of an input DenseTensor, to improve performance or reduce memory footprint. For example, the below src tensor's memory can be reused by dst tensor (inplaced ReLU).

Op1 -> src -> ReLU -> dst -> Op2

But if the input tensor is being used by multiple consumers, it's memory can't be reused until computing the last consumer. For example, the below src tensor's memory can't be reused by dst1 since the ReLU2 kernel is not executed yet and still relies on the data in src tensor.

Op1 -> src -> ReLU1 -> dst1 -> Op2
          \-> ReLU2 -> dst2 -> Op3

So, I would like to ask: is there any way to check whether a DenseTensor's memory can be reused or not inside a PHI kernel?

xinyu-intel commented 1 year ago

@yaomichael @weishengying @YuanRisheng Can you please help on this issue?

YuanRisheng commented 1 year ago

We don't support it inside PHI Kernel. I think this is a scheduling problem of Operators in Program. Maybe you can create a temp tensor that cloned from src for ReLU2?

YangQun1 commented 1 year ago

We don't support it inside PHI Kernel. I think this is a scheduling problem of Operators in Program. Maybe you can create a temp tensor that cloned from src for ReLU2?

Yeah, cloning src for ReLU2 is a solution, but then the question is how to know if I need to clone the src or not? If we always clone src, then there will be many extra memory copy and hurt performance.

YuanRisheng commented 1 year ago

We don't support it inside PHI Kernel. I think this is a scheduling problem of Operators in Program. Maybe you can create a temp tensor that cloned from src for ReLU2?

Yeah, cloning src for ReLU2 is a solution, but then the question is how to know if I need to clone the src or not? If we always clone src, then there will be many extra memory copy and hurt performance.

You can clone your src in Python code when you construct your network and this may not has many extra memory copy except that there are may node like ReLU2 in your network.

YangQun1 commented 1 year ago

We don't support it inside PHI Kernel. I think this is a scheduling problem of Operators in Program. Maybe you can create a temp tensor that cloned from src for ReLU2?

Yeah, cloning src for ReLU2 is a solution, but then the question is how to know if I need to clone the src or not? If we always clone src, then there will be many extra memory copy and hurt performance.

You can clone your src in Python code when you construct your network and this may not has many extra memory copy except that there are may node like ReLU2 in your network.

If so, users may need to change their network implementation to properly use inplace optimization. Is inplace a op's attributes or kernel's attributes? In my understanding, one op may have multiple kernels, will different kernel has different inplace support?

YuanRisheng commented 1 year ago

We don't support it inside PHI Kernel. I think this is a scheduling problem of Operators in Program. Maybe you can create a temp tensor that cloned from src for ReLU2?

Yeah, cloning src for ReLU2 is a solution, but then the question is how to know if I need to clone the src or not? If we always clone src, then there will be many extra memory copy and hurt performance.

You can clone your src in Python code when you construct your network and this may not has many extra memory copy except that there are may node like ReLU2 in your network.

If so, users may need to change their network implementation to properly use inplace optimization. Is inplace a op's attributes or kernel's attributes? In my understanding, one op may have multiple kernels, will different kernel has different inplace support?

If you use dygraph,I think you need change your code,and change api for inplace version: image

If you use static graph, @From00 may answer your question

xinyu-intel commented 1 year ago

@YangQun1 @YuanRisheng ReLU is just a simple case. Actually, this issue will happen not on user's graph but the execution graph optimized by several analysis passes. just like the following case:

prv_op -> tensor0 -> FusedOp -> tensor1 ->
                  \-> Op1 -> tensor2 ->

When implementing FusedOp, how to check if the tensor0 also will be used by Op1. And based on this information, we can deside whether tensor1 can safely share the buffer with tensor0.

@YangQun1 Probably, we can add a new analysis pass after the fusion pass to record the consumer info in the fusedop.

From00 commented 1 year ago

@YangQun1 @xinyu-intel Since you mentioned the execution graph optimized by passes, I think you are asking the inplace mechanism in static mode. In static mode, inplace is also an optimization pass, which analyses the memory reuse information for the entire network and properly inserts a share_buffer Op between the in-out tensor pair that requires inplace. The share_buffer Op make two tensors share the same memory, thereby achieving inplace. For more detail, see buffer_shared_inplace_op_pass. When implementing Fused Op, you just need to register an inplace-inferer to tell which in-out tensor pairs can share the same memory, then buffer_shared_inplace_op_pass will complete the other tasks. For an example about inplace-inferer, see Reshape Op.

paddle-bot[bot] commented 4 months ago

Since you haven\'t replied for more than a year, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. 由于您超过一年未回复,我们将关闭这个issue/pr。 若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。