NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
266 stars 53 forks source link

Concretization error from communication segment followed by compute segment. #2481

Closed cowanmeg closed 4 months ago

cowanmeg commented 4 months ago

I am seeing a concretization error where an IterDomain extent is not bound to a constant. This error only occurs when we have a reduction based collective (like allreduce/reduce scatter) followed by computation in another segment.

An example looks like this: ---- local matmul ---- T4_g[ ideviceIdx.x10{i0}, iS11{i3}, iS12{i6}, rS13{i2} ] (DeviceMesh{0 1}) = matmul(T3_l[ ideviceIdx.x7{i0}, iS9{i3}, iS8{i2} ] (DeviceMesh{0 1}), T1_g[ ideviceIdx.x3{i4}, iS4{i5}, iS5{i6} ] (DeviceMesh{0 1}))

---- allreduce --- (dispatched to ProcessGroup) T5_g[ rS14{i0}, iS15{i3}, iS16{i6} ] (DeviceMesh{0 1} = reduce([ideviceIdx.x3{i4}, iS4{i5}, iS5{i6} DeviceMesh{0 1}))

---- bias add --- Note: rS14{i0} is the Val that triggers the error. %kernel { T7_l[ iS19{i3}, iS20{i6} ] (DeviceMesh{0 1}) = half2float(T5_g[ rS14{i0}, iS15{i3}, iS16{i6} ] (DeviceMesh{0 1})); T6_l[ bS17{1}, iS18{i7} ] (DeviceMesh{0 1}) = broadcast( T2_l[ iS6{i7} ] (DeviceMesh{0 1}) ) T8_l[ bS21{1}, iS22{i7} ] (DeviceMesh{0 1}) = half2float(T6_l[ bS17{1}, iS18{i7} ] (DeviceMesh{0 1})); T9_g[ iS23{i3}, iS24{i6} ] (DeviceMesh{0 1}) = T7_l[ iS19{i3}, iS20{i6} ] (DeviceMesh{0 1})

Error: C++ exception with description "ext_opt.hasValue() INTERNAL ASSERT FAILED at "/csrc/dynamic_transform.cpp":279, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Could not evaluate dynamic extent: i0 Exception raised from DynamicTransformConcretizationInfo at csrc/dynamic_transform.cpp:279 (most recent call first):

The allreduce is handled outside of FusionExecutorCache, so I assume this is an issue of piping information about i0's extent to the bias add segment since it cannot be inferred from the segment inputs? If something similar was already implemented for compute segments that take an input with a reduced IterDomain a pointer to it would be greatly appreciated :)

wujingyue commented 4 months ago

Can you give me a reproducer? I guess it involves some local changes to your test, so I wasn't sure how.

cowanmeg commented 4 months ago

Yes, here is a minimal test to reproduce: https://github.com/NVIDIA/Fuser/commit/00d93fabf4f86409c4678e22983336506074a363

jacobhinkle commented 4 months ago

I haven't run the repro yet, but it looks like the issue is that we cannot determine the extent i0, which we need in order to tell whether that dimension is a broadcast dimension or not, or whether it is potentially size 0. I wonder how we should infer i0 based on inputs. Note that we will need to be able to do that even if we don't need to concretize that axis, because the symbolic value i0 can be used in other expressions: you could divide a tensor by it for example. If all we have is the "no-devices" aten tensor then we must have some other mechanism like the NCCL world size that we can bind to i0 in an ExpressionEvaluator.

cowanmeg commented 4 months ago

Sorry, should have clarified. Running just the matmul+allreduce works (in fact we have this test https://github.com/NVIDIA/Fuser/blob/main/tests/cpp/test_multidevice_matmul.cpp#L274) this failing test just tags an add at the end. Some background - currently we infer i0 (or any DID axis in the inputs) by the DeviceMesh size (https://github.com/NVIDIA/Fuser/blob/main/csrc/expr_evaluator.cpp#L143).

The binding error only occurs on the bias add segment. IIUC in that segment, concretizing i0 isn't even necessary since it's reduced iter domain? Also, I bolded the wrong iter domain in the example which is now fixed!

jacobhinkle commented 4 months ago

Sorry, should have clarified. Running just the matmul+allreduce works (in fact we have this test https://github.com/NVIDIA/Fuser/blob/main/tests/cpp/test_multidevice_matmul.cpp#L274) this failing test just tags an add at the end.

:+1:

Some background - currently we infer i0 (or any DID axis in the inputs) by the DeviceMesh size (https://github.com/NVIDIA/Fuser/blob/main/csrc/expr_evaluator.cpp#L143).

Great! I hadn't noticed that.

The binding error only occurs on the bias add segment. IIUC in that segment, concretizing i0 isn't even necessary since it's reduced iter domain? Also, I bolded the wrong iter domain in the example which is now fixed!

Ah! I think I understand now. Thank you for the explanation! One way to avoid errors like this is for us to strip off that annoying reduction domain for the segment input T5 when we do convertInputLogicalsToRoots (this could probably be renamed eraseRootsInInputs now). This also struck us here: https://github.com/NVIDIA/Fuser/issues/2317#issuecomment-2166339072. I'll give it a try soon.

cowanmeg commented 4 months ago

Yes, that is the exact same error! Thank you!!

cowanmeg commented 4 months ago

Verified the PR worked for this test! Thank you @jacobhinkle!