NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
269 stars 53 forks source link

Examine the broadcast size for aliased input/output when identifying RW race #3251

Open Priya2698 opened 3 weeks ago

Priya2698 commented 3 weeks ago

PR #2999 adds a presegmentation pass to force segmentation when inplace update can cause RW race. This occurs when an intermediate tensorview is aliased to a fusion input, a RW race occurs, when the intermediate tensorview or the aliased input is in path of a broadcast.

This preseg pass currently does not consider how the size of the broadcasted tv differs from aliased input/output, in which case segmentation may not be required.

Consider the test: https://github.com/NVIDIA/Fuser/blob/2ec2b926a0c589f7b72bd7a3abce7a49111c5620/tests/cpp/test_alias.cpp#L980-L1029, the fusion is segmented such that the RW race does not occur. Additionally, the broadcasted size is not different from the aliased input/output. However, this preseg pass will still insert a segment set and split the fusion into 3 segments, even though 2 segments is functionally correct.

} // {Re-written complete fusion}
Segmented_Fusion Dump: -- fusion segments:
Segmented_Fusion{ 
groups: 
  reduction{0, 1, 2}
  pointwise{3, 4, 5}
edges: 
  e{ reduction{0, 1, 2} -> pointwise{3, 4, 5}(T5_g_float[ iS9{i2} ]) }

group details:
g{(reduction)
group id: 0
inputs:
  T0_g_float[ iS0{i0}, iS1{i2} ] float
  T1_g_float[ iS16{i2} ] float
outputs:
  T5_g_float[ iS9{i2} ] float

T3_l_float[ iS5{i0}, iS6{i2} ]
   = T0_g_float[ iS0{i0}, iS1{i2} ]
   + double(1);
(0)
T4_g_float[ rS7{i0}, iS8{i2} ]
   = reduction( T3_l_float[ iS5{i0}, iS6{i2} ], op = fmax, initial value = double(-inf), allreduce = false )
(1)
T5_g_float[ iS9{i2} ]
   = T4_g_float[ rS7{i0}, iS8{i2} ]
   + T1_g_float[ iS16{i2} ];
(2)
}

g{(pointwise)
group id: 1
inputs:
  T0_g_float[ iS0{i0}, iS1{i2} ] float
  T2_g_float[ iS3{i4}, iS17{i2} ] float
  T5_g_float[ iS9{i2} ] float
outputs:
  T7_g_float[ iS12{i4}, iS13{i2} ] float
  T8_g_float[ iS14{i4}, iS15{i2} ] float

T6_g_float[ bS10{1}, iS11{i2} ]
   = broadcast( T5_g_float[ iS9{i2} ] )
(3)
T7_g_float[ iS12{i4}, iS13{i2} ]
   = T6_g_float[ bS10{1}, iS11{i2} ]
   + T2_g_float[ iS3{i4}, iS17{i2} ];
(4)
T8_g_float[ iS14{i4}, iS15{i2} ]
   = T7_g_float[ iS12{i4}, iS13{i2} ]
   + double(1);
(5)
}

This issue is to track how the preseg pass needs to be modified to identify such cases.

kevinstephano commented 1 week ago

@Priya2698 can this issue be closed?

Priya2698 commented 1 week ago

No, I need to investigate this issue and understand if any changes are required to the current implementation of the 'SegmentInplaceUpdate' preset pass.