NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.9k stars 320 forks source link

BUG: fused_amax_and_scale_update_after_reduction(): incompatible function arguments. The following argument types are supported: #1275

Open cassanof opened 1 week ago

cassanof commented 1 week ago

Currently getting the following error on a simple forward with a transformer model when using DelayedScaling:

110882 [rank0]:     with te.fp8_autocast(enabled=True, fp8_recipe=self.te_fp8_recipe):
110883 [rank0]:   File "/home/federico/.pyenv/versions/3.11.9/lib/python3.11/contextlib.py", line 144, in __exit__
110884 [rank0]:     next(self.gen)
110885 [rank0]:   File "/mnt/large_shared/federico/env/lib/python3.11/site-packages/transformer_engine/pytorch/fp8.py", line 581, in fp8_autocast
110886 [rank0]:     FP8GlobalStateManager.fp8_autocast_exit(enabled, _graph=_graph)
110887 [rank0]:   File "/mnt/large_shared/federico/env/lib/python3.11/site-packages/transformer_engine/pytorch/fp8.py", line 435, in fp8_autocast_exit
110888 [rank0]:     cls.reduce_and_update_fp8_tensors(forward=True, fp8_weights=False)
110889 [rank0]:   File "/mnt/large_shared/federico/env/lib/python3.11/site-packages/transformer_engine/pytorch/fp8.py", line 365, in reduce_and_update_fp8_tensors
110890 [rank0]:     tex.fused_amax_and_scale_update_after_reduction(
110891 [rank0]: TypeError: fused_amax_and_scale_update_after_reduction(): incompatible function arguments. The following argument types are supported:
110892 [rank0]:     1. (arg0: torch.Tensor, arg1: list[torch.Tensor], arg2: list[torch.Tensor], arg3: list[torch.Tensor], arg4: str, arg5: transformer_engine::DType, arg6:
       float) -> None
110893
110894 [rank0]: Invoked with: tensor([2.3625e+01, 3.7500e-01, 0.0000e+00, 2.3625e+01, 4.2188e-01, 0.0000e+00,
110895 [rank0]:         3.0000e+00, 3.9648e-01, 0.0000e+00, 9.2578e-01, 3.4570e-01, 0.0000e+00,
110896 [rank0]:         2.7188e+00, 3.6328e-01, 0.0000e+00, 2.7188e+00, 5.4688e-01, 0.0000e+00,
110897 [rank0]:         5.2188e+00, 5.1172e-01, 0.0000e+00, 1.1600e+02, 3.3594e-01, 0.0000e+00,
110898 [rank0]:         1.1600e+02, 9.4922e-01, 0.0000e+00, 2.7656e+00, 3.8477e-01, 0.0000e+00,
110899 [rank0]:         7.3438e-01, 2.9492e-01, 0.0000e+00, 1.6750e+01, 6.2109e-01, 0.0000e+00,
110900 [rank0]:         1.6750e+01, 4.7461e-01, 0.0000e+00, 1.6750e+01, 2.6367e-01, 0.0000e+00,
110901 [rank0]:         2.2188e+00, 5.1953e-01, 0.0000e+00, 4.4000e+01, 3.3203e-01, 0.0000e+00,
110902 [rank0]:         4.4000e+01, 6.6797e-01, 0.0000e+00, 1.8828e+00, 4.1211e-01, 0.0000e+00,
110903 [rank0]:         8.1250e-01, 3.9453e-01, 0.0000e+00, 1.8750e+01, 6.2109e-01, 0.0000e+00,
.... omitted rest, many tensors printed out ....

The recipe is quite simple: te_recipe.DelayedScaling(te_recipe.Format.HYBRID, amax_history_len=64, amax_compute_algo="max"). If I omit the recipe from the autocast context the forward works as expected.

Any ideas?

ksivaman commented 3 days ago

@cassanof Do you have a script that replicates this error? I'm not able to reproduce it with the same recipe. If not, could you give a more detailed stack trace with the argument types to tex.fused_amax_and_scale_update_after_reduction?

cassanof commented 2 days ago

Hi! unfortunately i cannot share, and wasn't able to repro with some of the open models. The arguments are a long list of different tensors.

At the end, i was able to get amax scaling to work by completely disabling the fused kernel in your code and using the non-fused instead. This is obviously undesired though.