NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
271 stars 53 forks source link

Translate segments to python definition #3335

Closed rdspring1 closed 6 days ago

rdspring1 commented 2 weeks ago

Overview:

Changes in this PR

This PR implements buildSegment function for user-scheduler segmentation. It is the second PR in a stack, preceded by https://github.com/NVIDIA/Fuser/pull/3334 and followed by https://github.com/NVIDIA/Fuser/pull/3025.

  1. Implement buildSegment function in csrc/python_frontend/segmentation.cpp.
  2. Complete segment function in nvfuser/__init__.py

Example:

Original Fusion: A reduction + broadcast + pointwise fusion.

def nvfuser_fusion_id1(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1, -1],
                          contiguity=[True, True],
                          dtype=DataType.Float,
                          is_cpu=False)
    T1 = fd.define_tensor(shape=[-1, -1],
                          contiguity=[True, True],
                          dtype=DataType.Float,
                          is_cpu=False)
    T2 = fd.ops.sum(T0, dims=[1], keepdim=False, dtype=DataType.Float)
    T3 = fd.ops.broadcast(T2, is_broadcast_dim=[False, True])
    T4 = fd.ops.add(T1, T3)
    fd.add_output(T4)

After Segmentation: The reduction scheduler does not support fusing any operations with an inner reduction, so the original fusion is divided into two segments.

First Segment:

The first segment contains the reduction and broadcast operations, which corresponds with [T0, T2, T3] in the original fusion. Therefore, the segment index to original index map has two entries.

Segment Index Original Index Description
T0 T0 The first tensor argument for the original fusion.
T2 T3 The broadcasted, reduction tensor is this segment's output.
def nvfuser_fusion_id2(fd : FusionDefinition) -> None :
   T0 = fd.define_tensor(shape=[-1, -1],
                         contiguity=[True, True],
                         dtype=DataType.Float,
                         is_cpu=False)
   T1 = fd.ops.sum(T0, dims=[1], keepdim=False, dtype=DataType.Float)
   T2 = fd.ops.broadcast(T1, is_broadcast_dim=[False, True])
   fd.add_output(T2)

Second Segment:

The second segment is the pointwise addition with the broadcasted reduction. It corresponds with [T1, T3, T4] in the original fusion.

Segment Index Original Index Description
T0 T1 The second tensor argument for the original fusion.
T1 T3 The broadcasted, reduction tensor, which is the output from the first segment.
T2 T4 The pointwise addition, which is the output for the original fusion.
def nvfuser_fusion_id3(fd : FusionDefinition) -> None :
   T0 = fd.define_tensor(shape=[-1, -1],
                         contiguity=[True, True],
                         dtype=DataType.Float,
                         is_cpu=False)
   T1 = fd.define_tensor(shape=[-1, 1],
                         contiguity=[True, None],
                         dtype=DataType.Float,
                         is_cpu=False)
   T2 = fd.ops.add(T0, T1)
   fd.add_output(T2)
Priya2698 commented 1 week ago

I am seeing changes from PR #3334, can you rebase to only include changes from this PR for easier review?

jjsjann123 commented 1 week ago

Oops. I think the merge of #3334 messed up the git history. You might have to resolve the conflicts by hand now.

rdspring1 commented 1 week ago

I used git rebase to fixed the conflicts.

rdspring1 commented 1 week ago

!test

rdspring1 commented 1 week ago

I renamed some variables to make things clearer. I hope it helps!!!