Open zasdfgbnm opened 1 year ago
call for review: @naoyam @csarofeen @mmigdal-nv @drzejan2
The multi-device runtime by @samnordmann seems quite relevant. He's extending the overall design. Maybe we should also consider finer-grained task parallelism like warp specialization.
In case of matmul and use of TensorCores, the horizontal group with tasks will effectively mean partitioning of input tensors into sub-regions and within each of such tasks will be a vertical group with two tasks:
Or this is too low level, and for now by task we mean a higher level operation, not low level memory managment?
The multi-device runtime by @samnordmann seems quite relevant. He's extending the overall design. Maybe we should also consider finer-grained task parallelism like warp specialization.
I would be happy to discuss about it. I'll put here some design docs when it will be ready
In case of matmul and use of TensorCores, the horizontal group with tasks will effectively mean partitioning of input tensors into sub-regions and within each of such tasks will be a vertical group with two tasks:
- handling data loading (TMA)
- process data with TensorCore ?
Or this is too low level, and for now by task we mean a higher level operation, not low level memory managment?
You are right, it is to parallelize TMA and tensor core.
Motivation
On Hopper, efficient gemm requires warp-specialization, which is not currently supported by nvFuser. This doc is to extend nvFuser in order to support such optimization. I believe this will not only benefit matmul, but also benefit other cases like optimal cat/stack scheduling, horizontal fusion, etc., see "Potential applications" section for more detail.
Design
Notation: I will mostly use the term "task parallelism" for the new thing being added to nvFuser. "Warp-specialization" is a special case of "vertical task parallelism" (described below) on thread index.
Partition of DAG as tasks
In order to use task parallelism, we need first partition the DAG into tasks. Tasks are non-overlapping and dense, that is, every
Val
in the fusion definition except fusion inputs belongs to a task (fusion inputs are special because they are given instead of computed), and oneVal
can only belong to one task. Initially, allVal
s belong to task 0. Example partition:Grouping tasks into a hierarchical structure
Tasks are further grouped into task groups. Task groups form a hierarchical structure.
Parallelization of task groups
A task group can be parallelized, for example
Not all task groups can be parallelized. A parallelizable group is either a "horizontal group" or a "vertical group". A "horizontal group" is a group whose members have no data dependency with each other. For example, group 1 is a horizontal group. A "vertical group" is a group whose members are connected, for example group 3 is a vertical group.
Below is an example where group 4 is neither a horizontal group nor a vertical group:
However, you can make it a horizontal group by grouping group 2 and group 3 together:
Expression sorting
Expression sorting must be task and task group aware. For the above example, the sorted expressions can be
which zoom into
which zoom into
which zoom into
Loop nest generation
When generating loop nest, for an unparallelized group, it just generate its members one after another. For parallelized groups, it will generate
kir::IfThenElse
s to dispatch between its members.Assuming
tv0
-tv8
has[BIDx, TIDy{size0}, TIDx]
,tv9
andtv10
has[BIDx, TIDy{size0*4}, TIDx]
(the cat dim is 1, I am assuming the cat dim is untouched by the scheduler in this example), and inline most.Then the generated loop nest structure will be
Synchronization
For the parallelization of horizontal task groups, synchronization must happen before and after the dispatch. Depending on the parallel type, block sync or grid sync might be needed. For the parallelization of vertical task groups (a.k.a. warp specialization), parallelization boundary (in this case
tv5
-tv8
) must be double/circular buffered, and arrive-wait barrier is used for sync.Potential applications
Efficient matmul on Hopper
See: https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp
Warp specialization is used, and we are doing load and mma+store in different warps.
Horizontal fusion
For example, we have multiple separate fusions, independently scheduled. If all fusions only uses
BIDx
but notBIDy
andBIDz
, then we can trivially horizontally fuse these fusions by partitioning each fusion as a task in the combined fusion and horizontal parallelize these tasks onBIDx
.Cat/stack schedule
For cat/stack, the size of the output tensor is naturally the sum of the size of inputs, we could parallelize the computation of inputs in a way like the parallelization of group 1 in the above example.