Open slyubomirsky opened 1 year ago
@tqchen A question about implementing one of the relatively simple cases, an in-place operation where the result is smaller than the input. I discussed with @MasterJH5574 and we weren't entirely sure about how this should work.
Let's suppose we have a call out = call_tir_inplace(some_func, (t1,), return_shape)
, where return_shape
is smaller (in at least one dimension) than t1
. out
will be treated as a tensor of the smaller shape (return_shape
), but will in reality be stored in the same place as t1
. @MasterJH5574 points out that slicing is an example where this could arise.
The question is, do we need to have any special handling in the memory planner for this case? Would Relax's runtime treat out
as being of t1
's shape even though it's supposed to be smaller? Where might we need to make changes to handle this case? I could imagine some difficulties potentially arising with stride, layout, etc.
(For now, I will implement the scenario where the output shape matches the input shape exactly.)
Alternative approach suggested by @tqchen and @psrivas2: Consider dataflow blocks only. This would have the advantage of avoiding a whole-program analysis for liveness and aliases and would be a large simplification due to not having to handle control flow, but the risk would be that the alias analysis would have to be overly conservative since any value that comes from outside the dataflow block would have to be treated as potentially an alias. This is worth trying on a real example (say, an excerpt from an LLM). If these very conservative versions of the analysis turn out to be sufficient, then that would be a reasonable starting point.
The original proposal has been implemented in #16129, though focusing mainly on dataflow blocks instead. I am also now working on an addition to have special handling for split
and concat
(when these are eligible to do in-place, they can be implemented as "no-ops" just by taking views of the underlying storage).
We can use the more complex approach of #15689 to implement a more general version that does not require dataflow blocks.
Per the discussion on in-place updates, this is a tracking issue to discuss the steps and implementation details.
call_tir_inplace
operator. This will handle the "simple case" described in the discussion thread, where the input tensor must be at least large enough to hold the desired output. At this stage, we will not handle memory planning for the cases where the input tensor is too small to hold the output (thus requiring the memory planner to ensure that the underlying storage is large enough).call_tir_inplace
invocations. At this stage, the memory planner should also be modified to handle cases where the input tensor needs a larger underlying storage.cc @quic-sanirudh