Open samnordmann opened 1 month ago
!build
Please add tests.
Please make sure to have segmented tests. I'm not sure if a given list of vector for a fusion is properly passed to each segment.
LGTM! Is this so that the host can control when buffers are allocated?
Yes! To have a better control on allocation and to be able to reuse pre-allocated buffers in different fusion executions
Please add tests.
Please make sure to have segmented tests. I'm not sure if a given list of vector for a fusion is properly passed to each segment.
Makes sense, that's what I did. Let me know what you think
!build
@jjsjann123 @wujingyue I wonder if this type of functionality could be useful for fusions when we have concat. I was thinking if it would be beneficial to artificially do a "concat" by allocation a contiguous concat tensor and passing portions of non-contiguous tensors to the fusion. This way the concat is done by allocation + strided writes, rather than our current pad/masks approach.
Taking preallocated outputs is useful for FusionExecutor, which I believe is already implemented but exercised only in unit tests. For example, what you just said is the "unzip+segment" approach in https://github.com/NVIDIA/Fuser/issues/1502, which requires FusionExecutor to take pre-allocated outputs. In addition, we'll probably need this if we want to call cublas/cudnn from a fusion because these math libraries don't allocate outputs by themselves.
However, I'm uncertain about FusionExecutorCache at this point, which orchestrates the execution of the complete fusion. It could be useful if Thunder wants nvFuser to output to a particular buffer allocated by the downstream executor.
This way the concat is done by allocation + strided writes, rather than our current pad/masks approach.
Agree that's a cleaner end result for codegen. But it is adding complexity to infrastructure code around it. My naive question here is, is pad
considered a tricky thing to handle in codegen hence we try to avoid it?
artificially do a "concat" by allocation a contiguous concat tensor and passing portions of non-contiguous tensors to the fusion. This way the concat is done by allocation + strided writes,
This was exactly my motivation
Taking preallocated outputs is useful for FusionExecutor, which I believe is already implemented but exercised only in unit tests.
Pre-allocating some but not all the output buffers is not supported by FusionExecutor
, however this is a critical piece to enable that feature, as pointed out by Naoya https://github.com/NVIDIA/Fuser/pull/2247#discussion_r1626422131
Another missing piece is the cache ID when output buffers are given, https://github.com/NVIDIA/Fuser/pull/2247#discussion_r1618809114
this patch allows the user to also pass outputs to
FusionExecutorCache::runFusionWithInput
. Before the patch, the outputs would be necessary allocated by the executor.The core capability is already supported by
FusionExecutor
. This pr only implements plumbing the output fromFusionExecutorCache
down toFusionExecutor