[QUESTION] How Do NCCL_ALGO and Flash Attention Affect Deterministic Training in Megatron?

Issue Description:

I read the information about reproducibility, which mentions using --deterministic-mode by setting NCCL_ALGO, NVTE_ALLOW_NONDETERMINISTIC_ALGO=0, and not using --use-flash-attn to achieve deterministic training.

I tested Megatron with dual-node (TP=2, PP=2) setups using eight A800 GPUs each, training for 50 iterations. I used this configuration for multiple runs and checked whether the saved models were identical each time (comparing parameters one by one). I found that setting NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 alone ensured identical model parameters across runs. It seems only this setting matters for reproducibility in my tests. Conversely, not setting this environment variable resulted in different model parameters being saved after each run.

Questions:

Under what conditions do NCCL_ALGO and --use-flash-attn cause non-deterministic training results?
In my environment, NCCL_ALGO defaults to None. In this case, how does NCCL choose the algorithm, and how can I know which algorithm is being selected?

Environment Details:

Hardware: Eight A800 GPUs per node
Setup: TP=2, PP=2
Training iterations: 50
Deterministic setting used: NVTE_ALLOW_NONDETERMINISTIC_ALGO=0

Thank you for your assistance.

NVIDIA / Megatron-LM

[QUESTION] How Do NCCL_ALGO and Flash Attention Affect Deterministic Training in Megatron? #925

Issue Description: