Allow FusionExecutorCache to take preallocated outputs

NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

Other

229 stars 42 forks source link

Allow FusionExecutorCache to take preallocated outputs #2247

Open samnordmann opened 1 month ago

samnordmann commented 1 month ago

this patch allows the user to also pass outputs to FusionExecutorCache::runFusionWithInput. Before the patch, the outputs would be necessary allocated by the executor.

The core capability is already supported by FusionExecutor. This pr only implements plumbing the output from FusionExecutorCache down to FusionExecutor

samnordmann commented 1 month ago

!build

naoyam commented 4 weeks ago

Please add tests.

Please make sure to have segmented tests. I'm not sure if a given list of vector for a fusion is properly passed to each segment.

samnordmann commented 2 weeks ago

LGTM! Is this so that the host can control when buffers are allocated?

Yes! To have a better control on allocation and to be able to reuse pre-allocated buffers in different fusion executions

samnordmann commented 2 weeks ago

Please add tests.

Please make sure to have segmented tests. I'm not sure if a given list of vector for a fusion is properly passed to each segment.

Makes sense, that's what I did. Let me know what you think

samnordmann commented 2 weeks ago

!build

csarofeen commented 4 days ago

@jjsjann123 @wujingyue I wonder if this type of functionality could be useful for fusions when we have concat. I was thinking if it would be beneficial to artificially do a "concat" by allocation a contiguous concat tensor and passing portions of non-contiguous tensors to the fusion. This way the concat is done by allocation + strided writes, rather than our current pad/masks approach.

wujingyue commented 4 days ago

Taking preallocated outputs is useful for FusionExecutor, which I believe is already implemented but exercised only in unit tests. For example, what you just said is the "unzip+segment" approach in https://github.com/NVIDIA/Fuser/issues/1502, which requires FusionExecutor to take pre-allocated outputs. In addition, we'll probably need this if we want to call cublas/cudnn from a fusion because these math libraries don't allocate outputs by themselves.

However, I'm uncertain about FusionExecutorCache at this point, which orchestrates the execution of the complete fusion. It could be useful if Thunder wants nvFuser to output to a particular buffer allocated by the downstream executor.

jjsjann123 commented 3 days ago

This way the concat is done by allocation + strided writes, rather than our current pad/masks approach.

Agree that's a cleaner end result for codegen. But it is adding complexity to infrastructure code around it. My naive question here is, is pad considered a tricky thing to handle in codegen hence we try to avoid it?

samnordmann commented 3 days ago

artificially do a "concat" by allocation a contiguous concat tensor and passing portions of non-contiguous tensors to the fusion. This way the concat is done by allocation + strided writes,

This was exactly my motivation

Taking preallocated outputs is useful for FusionExecutor, which I believe is already implemented but exercised only in unit tests.

Pre-allocating some but not all the output buffers is not supported by FusionExecutor, however this is a critical piece to enable that feature, as pointed out by Naoya https://github.com/NVIDIA/Fuser/pull/2247#discussion_r1626422131

Another missing piece is the cache ID when output buffers are given, https://github.com/NVIDIA/Fuser/pull/2247#discussion_r1618809114