apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators
https://tvm.apache.org/
Apache License 2.0
11.58k stars 3.43k forks source link

[Unity][Tracking Issue] In-place operations #15319

Open slyubomirsky opened 1 year ago

slyubomirsky commented 1 year ago

Per the discussion on in-place updates, this is a tracking issue to discuss the steps and implementation details.

cc @quic-sanirudh

slyubomirsky commented 1 year ago

@tqchen A question about implementing one of the relatively simple cases, an in-place operation where the result is smaller than the input. I discussed with @MasterJH5574 and we weren't entirely sure about how this should work.

Let's suppose we have a call out = call_tir_inplace(some_func, (t1,), return_shape), where return_shape is smaller (in at least one dimension) than t1. out will be treated as a tensor of the smaller shape (return_shape), but will in reality be stored in the same place as t1. @MasterJH5574 points out that slicing is an example where this could arise.

The question is, do we need to have any special handling in the memory planner for this case? Would Relax's runtime treat out as being of t1's shape even though it's supposed to be smaller? Where might we need to make changes to handle this case? I could imagine some difficulties potentially arising with stride, layout, etc.

(For now, I will implement the scenario where the output shape matches the input shape exactly.)

slyubomirsky commented 11 months ago

Alternative approach suggested by @tqchen and @psrivas2: Consider dataflow blocks only. This would have the advantage of avoiding a whole-program analysis for liveness and aliases and would be a large simplification due to not having to handle control flow, but the risk would be that the alias analysis would have to be overly conservative since any value that comes from outside the dataflow block would have to be treated as potentially an alias. This is worth trying on a real example (say, an excerpt from an LLM). If these very conservative versions of the analysis turn out to be sufficient, then that would be a reasonable starting point.

slyubomirsky commented 7 months ago

The original proposal has been implemented in #16129, though focusing mainly on dataflow blocks instead. I am also now working on an addition to have special handling for split and concat (when these are eligible to do in-place, they can be implemented as "no-ops" just by taking views of the underlying storage).

We can use the more complex approach of #15689 to implement a more general version that does not require dataflow blocks.