Alloc tensors without compiling a Fusion

NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

Other

250 stars 49 forks source link

Alloc tensors without compiling a Fusion #1526

Open samnordmann opened 9 months ago

samnordmann commented 9 months ago

In this PR about multi-device: https://github.com/NVIDIA/Fuser/pull/1244 we need to allocate some intermediate tensor buffer. The only simple way to do it right now is as follow:

copy the fusion and change the output to be the tensors we need to allocate
compile the fusion
call the method FusionExecutor::allocOutputSpace

Copying and compiling the Fusion is wasteful here. We should refactor FusionExecutor and expose a utility "allocTensors" that do not need the Fusion compilation.

kevinstephano commented 9 months ago

cc @wujingyue . Jingyue has been thinking about this as well for his work in segmentation with respect to decoupling allocoations and specific kernels but that is not related to multi-device usage. It would be good if everyone is aligned!

wujingyue commented 8 months ago

Yeah, meta-op-only fusions go through the same unnecessary fusion-to-kernel compilation. It's fortunately less wasteful because I changed expr_sort.cpp to not put alias-computing Exprs to the kernel IR.