Closed wujingyue closed 1 month ago
Summarizing my offline discussion with @naoyam:
As far as we can see, the right fix is to always bind the original/unsharded size, which is available via device meshes. For places where the sharded size (i.e. 1) is needed, e.g. allocateOutputs, add special cases for DIDs.
As a potential workaround, we talked about fixing #2755 so the extent Val*
likely maps to the same size within a fusion segment. It's not obviously easier than the right fix above. So I'll punt on that workaround but keep the door open to revisit it later.
cc @cowanmeg as I realized this is related to your https://github.com/NVIDIA/Fuser/pull/1932
FYI, @cowanmeg, 6577c8b12680c44ce11b61f358f6083a8dcf5bef tried to assign logical_size the unsharded size and logical_stride 0, for each DID dimension of course. This seems to fix the test case above. I'll run more tests to verify...
To reproduce,
https://github.com/NVIDIA/Fuser/pull/2635/files#diff-c8528013b87deea502aad00711c5310808bba8b463435d56b8cd5dd799854036R347 is the link to the test case. Below is a summary of the test:
T0
has shape[DIDx{i0}, i1, i2]
.T1
) becomes[1, DIDx{i1}, i2]
.add
addsshape(T0)[1] == i1
toT1
. In the test,i1 == 2
, however 1 was added instead.This issue is related to https://github.com/NVIDIA/Fuser/issues/2755 but is even trickier. It can't be worked around by not binding the complete fusion's input, because in this case the complete fusion's input
shape(T0)[1]
contains the right dimension size.