csarofeen / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
26 stars 7 forks source link

Sorting error observed in NVFuser #1319

Closed kevinstephano closed 2 years ago

kevinstephano commented 2 years ago

šŸ› Bug

Conversation from Horace:

hmmm... I think it has something to do with the previous sort error I mentioned.

Horace He  3 days ago
I'm writing a test minimizer, and that error started being thrown by nvfuser sometimes lol

Horace He  3 days ago
but only after the sorting error had already been thrown

Christian Sarofeen  3 days ago
Yeah makes sense.

Christian Sarofeen  3 days ago
Kernel is what is produced from lowering. Sorting failing means it wasn't lowered. So kernel doesn't get populated.

Christian Sarofeen  3 days ago
Were working on the sorting bug. Not sure if progress has been made on it yet except for making a minimal example.

To Reproduce

def forward(self, primals_67, primals_68, addmm_18, slice_1, getitem_41, getitem_42, mul_25, neg_5):
    native_layer_norm_14 = torch.ops.aten.native_layer_norm(addmm_18, [128], primals_67, primals_68, 1e-05)
    getitem_40 = native_layer_norm_14[0];  native_layer_norm_14 = None
    sigmoid_16 = torch.ops.aten.sigmoid(getitem_40);  getitem_40 = None
    mul_26 = torch.ops.aten.mul(slice_1, sigmoid_16);  slice_1 = None
    mul_27 = torch.ops.aten.mul(sigmoid_16, neg_5);  sigmoid_16 = neg_5 = None
    mul_28 = torch.ops.aten.mul(mul_25, mul_27);  mul_25 = mul_27 = None
    native_layer_norm_backward = torch.ops.aten.native_layer_norm_backward(mul_28, addmm_18, [128], getitem_41, getitem_42, primals_67, primals_68, [True, True, True]);  mul_28 = addmm_18 = getitem_41 = getitem_42 = primals_67 = primals_68 = None
    getitem_55 = native_layer_norm_backward[0];  native_layer_norm_backward = None
    add_7 = torch.ops.aten.add(mul_26, getitem_55);  mul_26 = getitem_55 = None
    return ([add_7],)
     [torch.Size([128]), torch.Size([128]), torch.Size([8192, 128]), torch.Size([8192, 128]), torch.Size([8192, 1]), torch.Size([8192, 1]), torch.Size([8192, 128]), torch.Size([8192, 128])]
Chillee commented 2 years ago

Duplicate of https://github.com/csarofeen/pytorch/issues/1305 I believe

csarofeen commented 2 years ago

This should be fixed as of https://github.com/csarofeen/pytorch/issues/1305 @kevinstephano or @Chillee would you mind confirming?

csarofeen commented 2 years ago

Closing, please reopen if still an issue CC @Chillee

Chillee commented 2 years ago

Yes, the 2 issues are duplicates (so this one is now also fixed).