Closed cowanmeg closed 4 months ago
Can you give me a reproducer? I guess it involves some local changes to your test, so I wasn't sure how.
Yes, here is a minimal test to reproduce: https://github.com/NVIDIA/Fuser/commit/00d93fabf4f86409c4678e22983336506074a363
I haven't run the repro yet, but it looks like the issue is that we cannot determine the extent i0
, which we need in order to tell whether that dimension is a broadcast dimension or not, or whether it is potentially size 0. I wonder how we should infer i0
based on inputs. Note that we will need to be able to do that even if we don't need to concretize that axis, because the symbolic value i0
can be used in other expressions: you could divide a tensor by it for example. If all we have is the "no-devices" aten tensor then we must have some other mechanism like the NCCL world size that we can bind to i0
in an ExpressionEvaluator
.
Sorry, should have clarified. Running just the matmul+allreduce works (in fact we have this test https://github.com/NVIDIA/Fuser/blob/main/tests/cpp/test_multidevice_matmul.cpp#L274) this failing test just tags an add at the end. Some background - currently we infer i0 (or any DID axis in the inputs) by the DeviceMesh size (https://github.com/NVIDIA/Fuser/blob/main/csrc/expr_evaluator.cpp#L143).
The binding error only occurs on the bias add segment. IIUC in that segment, concretizing i0 isn't even necessary since it's reduced iter domain? Also, I bolded the wrong iter domain in the example which is now fixed!
Sorry, should have clarified. Running just the matmul+allreduce works (in fact we have this test https://github.com/NVIDIA/Fuser/blob/main/tests/cpp/test_multidevice_matmul.cpp#L274) this failing test just tags an add at the end.
:+1:
Some background - currently we infer i0 (or any DID axis in the inputs) by the DeviceMesh size (https://github.com/NVIDIA/Fuser/blob/main/csrc/expr_evaluator.cpp#L143).
Great! I hadn't noticed that.
The binding error only occurs on the bias add segment. IIUC in that segment, concretizing i0 isn't even necessary since it's reduced iter domain? Also, I bolded the wrong iter domain in the example which is now fixed!
Ah! I think I understand now. Thank you for the explanation! One way to avoid errors like this is for us to strip off that annoying reduction domain for the segment input T5 when we do convertInputLogicalsToRoots
(this could probably be renamed eraseRootsInInputs
now). This also struck us here: https://github.com/NVIDIA/Fuser/issues/2317#issuecomment-2166339072. I'll give it a try soon.
Yes, that is the exact same error! Thank you!!
Verified the PR worked for this test! Thank you @jacobhinkle!
I am seeing a concretization error where an IterDomain extent is not bound to a constant. This error only occurs when we have a reduction based collective (like allreduce/reduce scatter) followed by computation in another segment.
An example looks like this: ---- local matmul ---- T4_g[ ideviceIdx.x10{i0}, iS11{i3}, iS12{i6}, rS13{i2} ] (DeviceMesh{0 1}) = matmul(T3_l[ ideviceIdx.x7{i0}, iS9{i3}, iS8{i2} ] (DeviceMesh{0 1}), T1_g[ ideviceIdx.x3{i4}, iS4{i5}, iS5{i6} ] (DeviceMesh{0 1}))
---- allreduce --- (dispatched to ProcessGroup) T5_g[ rS14{i0}, iS15{i3}, iS16{i6} ] (DeviceMesh{0 1} = reduce([ideviceIdx.x3{i4}, iS4{i5}, iS5{i6} DeviceMesh{0 1}))
---- bias add --- Note: rS14{i0} is the Val that triggers the error. %kernel { T7_l[ iS19{i3}, iS20{i6} ] (DeviceMesh{0 1}) = half2float(T5_g[ rS14{i0}, iS15{i3}, iS16{i6} ] (DeviceMesh{0 1})); T6_l[ bS17{1}, iS18{i7} ] (DeviceMesh{0 1}) = broadcast( T2_l[ iS6{i7} ] (DeviceMesh{0 1}) ) T8_l[ bS21{1}, iS22{i7} ] (DeviceMesh{0 1}) = half2float(T6_l[ bS17{1}, iS18{i7} ] (DeviceMesh{0 1})); T9_g[ iS23{i3}, iS24{i6} ] (DeviceMesh{0 1}) = T7_l[ iS19{i3}, iS20{i6} ] (DeviceMesh{0 1})
Error: C++ exception with description "ext_opt.hasValue() INTERNAL ASSERT FAILED at "/csrc/dynamic_transform.cpp":279, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Could not evaluate dynamic extent: i0 Exception raised from DynamicTransformConcretizationInfo at csrc/dynamic_transform.cpp:279 (most recent call first):
The allreduce is handled outside of FusionExecutorCache, so I assume this is an issue of piping information about i0's extent to the bias add segment since it cannot be inferred from the segment inputs? If something similar was already implemented for compute segments that take an input with a reduced IterDomain a pointer to it would be greatly appreciated :)