`flash-attn` fix + new Frameworks on Sunspot - Githubissues

argonne-lcf / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

Other

7 stars 8 forks source link

`flash-attn` fix + new Frameworks on Sunspot #13

Closed saforem2 closed 4 months ago

saforem2 commented 4 months ago

Changes:
- Set default "export LR_WARMUP_FRAC="${LR_WARMUP_FRAC:-0.05}" which will warmup the learning rate over the first 5% of the total training iterations
- Add ability to specify LR_DECAY_ITERS during training.
  - will default to None if not specified, according to the default from megatron/arguments.py
- Polaris:
  - move AWS NCCL OFI plugin script to ALCF/aws_ofi_nccl_plugin.sh
- Sunspot:
  - add support for new frameworks release: anl_24_q2_release
    - Update ALCF/sunspot-env.sh to reflect this change
  - adds fix for flash-attn discrepancy:
    
    Loss Curves:
    ![ScreenShot-2024-05-16-115705](https://github.com/argonne-lcf/Megatron-DeepSpeed/assets/5234251/ed1d245b-e1e2-4e63-a18f-e17312f68594)