LayerNorm unit test fails in NGC Docker 24.10 environment

fanzhongyi commented 3 weeks ago

I am encountering an issue where the LayerNorm unit tests fail during execution in the ngc docker 24.10 environment. Specifically, the gradient matching between the Triton-based TritonLayerNorm and the PyTorch standard torch.nn.LayerNorm is not passing. It seems that the gradients for the weight and bias parameters in the custom Triton-based LayerNorm implementation are not being calculated properly. TThe assertion error message is:

tests/test_layernorm.py:69: AssertionError ========================================================================== short test summary info =========================================================================== FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[1-128-256] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[8-512-1024] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[16-256-512] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[4-1024-768] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[8-1024-1024] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[16-1024-1024] - AssertionError: LayerNorm weight gradients don't match! FAILED tests/test_layernorm.py::TestLayerNorm::test_backward_match[32-512-1024] - AssertionError: LayerNorm weight gradients don't match! ================================================================== 7 failed, 36 passed in 93.55s (0:01:33) ===================================================================

dame-cell commented 3 weeks ago

Ah yes I change the layernorm kernel few weeks ago I might have to check the test again

Thank you so much reporting Will fix it by today

dame-cell commented 3 weeks ago

Hey , I was able to fix it. the problem was that the backward kernel return only the input grads and w and b grad was no longer outputted

you can now run the test and it should work

============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.3.3, pluggy-1.5.0
plugins: time-machine-2.14.1, typeguard-4.3.0, anyio-4.4.0
collected 16 items                                                             

test_layernorm.py ................                                       [100%]

============================== 16 passed in 8.63s ==============================
add Codeadd Markdown

fanzhongyi commented 3 weeks ago

Thank you for your prompt response. I have tested the fix, but unfortunately, I encountered a bug in the test code after the recent commit. The issue occurs in the following following lines, where the assertion always passes, even though it shouldn't. After I fixed this, the test still fails.

Additionally, I believe it would be beneficial to add a test case that explicitly compares the gradients of weight and bias between the Triton-based TritonLayerNorm and PyTorch's torch.nn.LayerNorm. This would ensure that the gradient calculations are consistent across both implementations.

Thanks again for your support.

dame-cell commented 2 weeks ago

I see ok I will try to add a test case that explicitly compares the gradients of weight and bias between the Triton-based TritonLayerNorm and PyTorch's torch.nn.LayerNorm.

and I'll check my implementation again

fanzhongyi commented 2 weeks ago

Looking forward to your updates, and thank you very much for your open-source.

dame-cell / Triformer

LayerNorm unit test fails in NGC Docker 24.10 environment #2