Open samnordmann opened 9 months ago
cc @wujingyue . Jingyue has been thinking about this as well for his work in segmentation with respect to decoupling allocoations and specific kernels but that is not related to multi-device usage. It would be good if everyone is aligned!
Yeah, meta-op-only fusions go through the same unnecessary fusion-to-kernel compilation. It's fortunately less wasteful because I changed expr_sort.cpp
to not put alias-computing Expr
s to the kernel IR.
In this PR about multi-device: https://github.com/NVIDIA/Fuser/pull/1244 we need to allocate some intermediate tensor buffer. The only simple way to do it right now is as follow:
Copying and compiling the Fusion is wasteful here. We should refactor FusionExecutor and expose a utility "allocTensors" that do not need the Fusion compilation.