Open sirutBuasai opened 1 month ago
It seems that our numerical tolerances are too tight for TF32 compute. It's suggestive that only FP32 tests are failing and that the errors in test_layernorm_linear_accuracy
are near the machine epsilon of TF32 (5e-4). Also, TE configures GEMMs with FP32 data to perform TF32 compute:
https://github.com/NVIDIA/TransformerEngine/blob/3b89c36f0e7427199e4e87076b8a5e1545d70346/transformer_engine/common/gemm/cublaslt_gemm.cu#L115-L117
I think we didn't notice this before because NVIDIA PyTorch containers enable TF32 by default, so we were using the same cuBLAS kernels in the TE module and the PyTorch reference. However, vanilla PyTorch disables TF32 by default.
We have recently done some work to make our numerical testing more robust. In particular, https://github.com/NVIDIA/TransformerEngine/pull/1229 reduces the size of the test cases. We should follow-up by tweaking the tolerances based on the data and compute dtypes.
Related: https://github.com/NVIDIA/TransformerEngine/issues/494
Hi,
I currently observed the following sanity test error when running with PyTorch 2.4.0 + CUDA 12.4 + cuDNN 9.1.0.
This is running on a single AWS
p4d.24xlarge
instance with A100 GPUs within a docker container.The test is run using
TE is installed through
Installed libaries: