NVIDIA Megatron-LM issues

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

10.62k stars 2.38k forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

[BUG] validate_yaml() isn't in sync with arguments check

#1297 pierric opened 1 day ago
0
[Update] Print training log in rank0

#1296 shijungg opened 1 day ago
0
[QUESTION] deepseek v2 compatility?

#1295 wavy-jung opened 1 day ago
0
[BUG] LLaVA may fail with EPP0 PP>1

#1293 lostkevin closed 15 hours ago
0
[BUG] 0.9.0 release version got param_gather_handle error with 3d parallel

#1292 SeunghyunSEO opened 3 days ago
3
[QUESTION] How to convert torch_dist format checkpoint to torch format?

#1291 zhangyilalala opened 3 days ago
0
support qwen2 hf<->mcore ckpt converter

#1290 wenyujin333 opened 3 days ago
0
Fix: misnamed sharded instead of common in checkpoint

#1289 prrathi opened 5 days ago
0
Hakiymaz/deepseekv2 enablement

#1288 hakankiymaz-amd closed 1 week ago
0
[QUESTION] SGD support in distrib_optimizer.py

#1287 zstreeter opened 1 week ago
0
Fix: Resolve multimodal model errors and update README usage instructions

#1286 singleheart opened 1 week ago
0
Set `torch.multiprocessing` start method as 'spawn'

#1285 hxdtest opened 1 week ago
0
Fix a bug in optimizer's mix_lr/max_lr when args.override_opt_param_scheduler==True

#1284 lyuwen opened 1 week ago
0
[QUESTION] There is already a 32-bit model parameter in the optimizer state. Why do we need to store a separate copy of the model parameters in the checkpoint?

#1283 leondada opened 1 week ago
0
Ci pipeline mi300

#1282 gurpreet-dhami closed 1 week ago
0
Where can I download the tokenizer for the model mcore-llava-mistral-7b-instruct-clip336-pretraining?

#1281 herolxl opened 1 week ago
0
[BUG]Megatron-LM doesn't support transformer-engine 1.13

#1280 klhhhhh opened 1 week ago
1
[BUG] Encountering NaN gradients when using CUDA Graph

#1279 DXZDXZ opened 1 week ago
1
Distributed chkpt save fix

#1278 zstreet87 closed 1 week ago
0
[QUESTION] is there any restriction to use allgather with moe_expert_capacity_factor?

#1277 Louis-J opened 2 weeks ago
0
[QUESTION] scaleing MFU calculate

#1276 ltm920716 opened 2 weeks ago
0
[BUG] TP-comm-overlap bug when replacing `TELayerNormColumnParallelLinear` into `TEColumnParallelLinear` .

#1275 wplf opened 2 weeks ago
0
[BUG] training crash when set --tp-comm-overlap

#1274 ltm920716 closed 4 days ago
12
Huvu/update t5 attentionmasktype

#1273 huvunvidia opened 2 weeks ago
0
[QUESTION] How to Visualize Computational Graph

#1272 zixianwang2022 opened 2 weeks ago
0
Update t5_model.py

#1271 huvunvidia opened 2 weeks ago
0
[ENHANCEMENT] Add z-loss

#1270 wdevazelhes closed 1 week ago
1
[BUG] The `cached_loss_mask` maybe modified unexpectedly in GPTDataset?

#1269 shmily326 opened 3 weeks ago
0
Enable huggingface tokenizer

#1268 msiddaiah opened 3 weeks ago
0
[BUG] build multimodal dockerfile problem

#1267 FortuneBush opened 3 weeks ago
0
[QUESTION] How to use loader_mcore and why it requires torch distributed

#1266 KookHoiKim opened 3 weeks ago
1
fix: remove unnecessary trailing comma in statement

#1265 singleheart opened 3 weeks ago
0
Jinda/legal review

#1264 jindajia closed 3 weeks ago
0
[ENHANCEMENT] Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining

#1263 dhia680 opened 3 weeks ago
0
Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining

#1262 dhia680 opened 3 weeks ago
3
[ENHANCEMENT] Add support for Apex RMSNorm for use in qk-norm

#1261 wdevazelhes opened 3 weeks ago
0
Add support to process gzip files

#1260 puneeshkhanna opened 3 weeks ago
0
[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed

#1259 efsotr opened 3 weeks ago
1
[BUG] MoE pre-training does not scale beyond DP dim>8

#1258 hwang595 opened 4 weeks ago
0
[QUESTION] NVIDIA Megatron Core 0.9.0 does not have shared_experts.py

#1257 clarence-lee-sheng closed 1 week ago
3
[QUESTION] Effect of sequence parallel with dropout rng context

#1256 sbmaruf closed 3 weeks ago
2
[QUESTION]Transformer Engine is totally a shit.

#1239 ZihaoZheng98 closed 1 month ago
0
[QUESTION] Do tp overlap support thd, whose sequence length is flexible?

#1238 wplf closed 4 weeks ago
0
[QUESTION]Using FP8 OOM, otherwise --bf16 works well

#1237 yanchenmochen closed 4 weeks ago
0
Megatron FP8 training is compatible with recompute?

#1236 yanchenmochen closed 4 weeks ago
0
Add fallbacks for c++ extension + jit_fuser

#1235 marcromeyn closed 1 month ago
1
[BUG] Cannot Save mamba model in distributed training

#1234 siriusctrl opened 1 month ago
2
Make it an option to use TransformerEngine activation function in FFN block

#1233 guyueh1 opened 1 month ago
0
[QUESTION] How to load checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism) can be loaded in a different parallel configuration ?

#1232 polisettyvarma closed 4 weeks ago
1
[QUESTION]how to incorporate MOE into hybrid mamba efficiently

#1231 sunying2018 closed 4 weeks ago
0