NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.82k stars 2.21k forks source link

[QUESTION] How Do NCCL_ALGO and Flash Attention Affect Deterministic Training in Megatron? #925

Open jinzhuer opened 1 month ago

jinzhuer commented 1 month ago

Issue Description:

I read the information about reproducibility, which mentions using --deterministic-mode by setting NCCL_ALGO, NVTE_ALLOW_NONDETERMINISTIC_ALGO=0, and not using --use-flash-attn to achieve deterministic training.

I tested Megatron with dual-node (TP=2, PP=2) setups using eight A800 GPUs each, training for 50 iterations. I used this configuration for multiple runs and checked whether the saved models were identical each time (comparing parameters one by one). I found that setting NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 alone ensured identical model parameters across runs. It seems only this setting matters for reproducibility in my tests. Conversely, not setting this environment variable resulted in different model parameters being saved after each run.

Questions:

  1. Under what conditions do NCCL_ALGO and --use-flash-attn cause non-deterministic training results?
  2. In my environment, NCCL_ALGO defaults to None. In this case, how does NCCL choose the algorithm, and how can I know which algorithm is being selected?

Environment Details:

Thank you for your assistance.

yaox12 commented 1 month ago

Flash attention added a deterministic flag since v2.4. For FA version >= 2.4, NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 will automatically set this flag. For FA version < 2.4, you need to disable it. NCCL_ALGO=NVLS is only supported on platforms with NVLink switches. You can set NCCL_DEBUG=INFO to check which algorithm is selected.