Translate segments to python definition

rdspring1 commented 2 weeks ago

Overview:

buildSegment creates the CPP Fusion for a given segment id, translates it to a python FusionDefinition, then returns a mapping from the segment fusion state indices to the original fusion state indices.
FusionDefinition.segment calls setupSegmentation, buildSegment, and finalizeSegmentation to create python definitions for the sub-fusions and their index mappings.

Changes in this PR

This PR implements buildSegment function for user-scheduler segmentation. It is the second PR in a stack, preceded by https://github.com/NVIDIA/Fuser/pull/3334 and followed by https://github.com/NVIDIA/Fuser/pull/3025.

Implement buildSegment function in csrc/python_frontend/segmentation.cpp.
Complete segment function in nvfuser/__init__.py

Example:

Original Fusion: A reduction + broadcast + pointwise fusion.

def nvfuser_fusion_id1(fd : FusionDefinition) -> None :
    T0 = fd.define_tensor(shape=[-1, -1],
                          contiguity=[True, True],
                          dtype=DataType.Float,
                          is_cpu=False)
    T1 = fd.define_tensor(shape=[-1, -1],
                          contiguity=[True, True],
                          dtype=DataType.Float,
                          is_cpu=False)
    T2 = fd.ops.sum(T0, dims=[1], keepdim=False, dtype=DataType.Float)
    T3 = fd.ops.broadcast(T2, is_broadcast_dim=[False, True])
    T4 = fd.ops.add(T1, T3)
    fd.add_output(T4)

After Segmentation: The reduction scheduler does not support fusing any operations with an inner reduction, so the original fusion is divided into two segments.

First Segment:

The first segment contains the reduction and broadcast operations, which corresponds with [T0, T2, T3] in the original fusion. Therefore, the segment index to original index map has two entries.

Segment Index	Original Index	Description
T0	T0	The first tensor argument for the original fusion.
T2	T3	The broadcasted, reduction tensor is this segment's output.

def nvfuser_fusion_id2(fd : FusionDefinition) -> None :
   T0 = fd.define_tensor(shape=[-1, -1],
                         contiguity=[True, True],
                         dtype=DataType.Float,
                         is_cpu=False)
   T1 = fd.ops.sum(T0, dims=[1], keepdim=False, dtype=DataType.Float)
   T2 = fd.ops.broadcast(T1, is_broadcast_dim=[False, True])
   fd.add_output(T2)

Second Segment:

The second segment is the pointwise addition with the broadcasted reduction. It corresponds with [T1, T3, T4] in the original fusion.

Segment Index	Original Index	Description
T0	T1	The second tensor argument for the original fusion.
T1	T3	The broadcasted, reduction tensor, which is the output from the first segment.
T2	T4	The pointwise addition, which is the output for the original fusion.

def nvfuser_fusion_id3(fd : FusionDefinition) -> None :
   T0 = fd.define_tensor(shape=[-1, -1],
                         contiguity=[True, True],
                         dtype=DataType.Float,
                         is_cpu=False)
   T1 = fd.define_tensor(shape=[-1, 1],
                         contiguity=[True, None],
                         dtype=DataType.Float,
                         is_cpu=False)
   T2 = fd.ops.add(T0, T1)
   fd.add_output(T2)

Priya2698 commented 1 week ago

I am seeing changes from PR #3334, can you rebase to only include changes from this PR for easier review?

jjsjann123 commented 1 week ago

Oops. I think the merge of #3334 messed up the git history. You might have to resolve the conflicts by hand now.

rdspring1 commented 1 week ago

I used git rebase to fixed the conflicts.

rdspring1 commented 1 week ago

!test

rdspring1 commented 1 week ago

I renamed some variables to make things clearer. I hope it helps!!!

NVIDIA / Fuser