nvFuser confuses sharded dimension size and unsharded.

wujingyue commented 1 month ago

To reproduce,

git fetch origin wjy/matmul
git checkout wjy/matmul
mpirun -np 2 bin/test_multidevice --gtest_filter=*ReduceScatter_Add

https://github.com/NVIDIA/Fuser/pull/2635/files#diff-c8528013b87deea502aad00711c5310808bba8b463435d56b8cd5dd799854036R347 is the link to the test case. Below is a summary of the test:

The input T0 has shape [DIDx{i0}, i1, i2].
After the ReduceScatter, the shape (of T1) becomes [1, DIDx{i1}, i2].
The add adds shape(T0)[1] == i1 to T1. In the test, i1 == 2, however 1 was added instead.

This issue is related to https://github.com/NVIDIA/Fuser/issues/2755 but is even trickier. It can't be worked around by not binding the complete fusion's input, because in this case the complete fusion's input shape(T0)[1] contains the right dimension size.

wujingyue commented 1 month ago

Summarizing my offline discussion with @naoyam:

As far as we can see, the right fix is to always bind the original/unsharded size, which is available via device meshes. For places where the sharded size (i.e. 1) is needed, e.g. allocateOutputs, add special cases for DIDs.

As a potential workaround, we talked about fixing #2755 so the extent Val* likely maps to the same size within a fusion segment. It's not obviously easier than the right fix above. So I'll punt on that workaround but keep the door open to revisit it later.

wujingyue commented 1 month ago

cc @cowanmeg as I realized this is related to your https://github.com/NVIDIA/Fuser/pull/1932

wujingyue commented 1 month ago

FYI, @cowanmeg, 6577c8b12680c44ce11b61f358f6083a8dcf5bef tried to assign logical_size the unsharded size and logical_stride 0, for each DID dimension of course. This seems to fix the test case above. I'll run more tests to verify...

NVIDIA / Fuser

nvFuser confuses sharded dimension size and unsharded. #2758