NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

[ERROR] cuBLAS error when launch training with Megatron-LM and TransformerEngine #847

Closed Btlmd closed 1 month ago

Btlmd commented 1 month ago

Hi,

I am using Megatron-LM with TransformerEngine to launch LM training. I encountered the following issue when the dp world size is not rounded enough, like 30.

RuntimeError: TransformerEngine/transformer_engine/common/gemm/cublaslt_gemm.cu:326 in function cublas_gemm: cuBLAS Error: the requested functionality is not supported

This error is related to #845. However, after fixing the alignment issue following #845, the above error is solved but we encountered another error, where

The strange thing is that the error is only reported on some of the nodes where all it addresses are aligned to 256. We reproduced this error on nvcr.io/nvidia/nemo:24.03.01.framework and nvcr.io/nvidia/nemo:24.01.01.framework

I'm not sure where to start further debugging. I would be grateful if anyone could offer some help.

Btlmd commented 1 month ago

Do you have any idea concerning this error? @phu0ngng

Btlmd commented 1 month ago

The error is fixed by https://github.com/NVIDIA/Megatron-LM/commit/c3677e09aa4e2eec37048307bd795928b8f8324a