issues
search
NVIDIA
/
Megatron-LM
Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.62k
stars
2.38k
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
[BUG] validate_yaml() isn't in sync with arguments check
#1297
pierric
opened
1 day ago
0
[Update] Print training log in rank0
#1296
shijungg
opened
1 day ago
0
[QUESTION] deepseek v2 compatility?
#1295
wavy-jung
opened
1 day ago
0
[BUG] LLaVA may fail with EPP0 PP>1
#1293
lostkevin
closed
15 hours ago
0
[BUG] 0.9.0 release version got param_gather_handle error with 3d parallel
#1292
SeunghyunSEO
opened
3 days ago
3
[QUESTION] How to convert torch_dist format checkpoint to torch format?
#1291
zhangyilalala
opened
3 days ago
0
support qwen2 hf<->mcore ckpt converter
#1290
wenyujin333
opened
3 days ago
0
Fix: misnamed sharded instead of common in checkpoint
#1289
prrathi
opened
5 days ago
0
Hakiymaz/deepseekv2 enablement
#1288
hakankiymaz-amd
closed
1 week ago
0
[QUESTION] SGD support in distrib_optimizer.py
#1287
zstreeter
opened
1 week ago
0
Fix: Resolve multimodal model errors and update README usage instructions
#1286
singleheart
opened
1 week ago
0
Set `torch.multiprocessing` start method as 'spawn'
#1285
hxdtest
opened
1 week ago
0
Fix a bug in optimizer's mix_lr/max_lr when args.override_opt_param_scheduler==True
#1284
lyuwen
opened
1 week ago
0
[QUESTION] There is already a 32-bit model parameter in the optimizer state. Why do we need to store a separate copy of the model parameters in the checkpoint?
#1283
leondada
opened
1 week ago
0
Ci pipeline mi300
#1282
gurpreet-dhami
closed
1 week ago
0
Where can I download the tokenizer for the model mcore-llava-mistral-7b-instruct-clip336-pretraining?
#1281
herolxl
opened
1 week ago
0
[BUG]Megatron-LM doesn't support transformer-engine 1.13
#1280
klhhhhh
opened
1 week ago
1
[BUG] Encountering NaN gradients when using CUDA Graph
#1279
DXZDXZ
opened
1 week ago
1
Distributed chkpt save fix
#1278
zstreet87
closed
1 week ago
0
[QUESTION] is there any restriction to use allgather with moe_expert_capacity_factor?
#1277
Louis-J
opened
2 weeks ago
0
[QUESTION] scaleing MFU calculate
#1276
ltm920716
opened
2 weeks ago
0
[BUG] TP-comm-overlap bug when replacing `TELayerNormColumnParallelLinear` into `TEColumnParallelLinear` .
#1275
wplf
opened
2 weeks ago
0
[BUG] training crash when set --tp-comm-overlap
#1274
ltm920716
closed
4 days ago
12
Huvu/update t5 attentionmasktype
#1273
huvunvidia
opened
2 weeks ago
0
[QUESTION] How to Visualize Computational Graph
#1272
zixianwang2022
opened
2 weeks ago
0
Update t5_model.py
#1271
huvunvidia
opened
2 weeks ago
0
[ENHANCEMENT] Add z-loss
#1270
wdevazelhes
closed
1 week ago
1
[BUG] The `cached_loss_mask` maybe modified unexpectedly in GPTDataset?
#1269
shmily326
opened
3 weeks ago
0
Enable huggingface tokenizer
#1268
msiddaiah
opened
3 weeks ago
0
[BUG] build multimodal dockerfile problem
#1267
FortuneBush
opened
3 weeks ago
0
[QUESTION] How to use loader_mcore and why it requires torch distributed
#1266
KookHoiKim
opened
3 weeks ago
1
fix: remove unnecessary trailing comma in statement
#1265
singleheart
opened
3 weeks ago
0
Jinda/legal review
#1264
jindajia
closed
3 weeks ago
0
[ENHANCEMENT] Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining
#1263
dhia680
opened
3 weeks ago
0
Enabling LR scaling for a specific layer (ex. down-projection...) during pretraining
#1262
dhia680
opened
3 weeks ago
3
[ENHANCEMENT] Add support for Apex RMSNorm for use in qk-norm
#1261
wdevazelhes
opened
3 weeks ago
0
Add support to process gzip files
#1260
puneeshkhanna
opened
3 weeks ago
0
[BUG] Flash attention cannot be applied by passing the --use-flash-attn flag when the --use-mcore-models flag is also passed
#1259
efsotr
opened
3 weeks ago
1
[BUG] MoE pre-training does not scale beyond DP dim>8
#1258
hwang595
opened
4 weeks ago
0
[QUESTION] NVIDIA Megatron Core 0.9.0 does not have shared_experts.py
#1257
clarence-lee-sheng
closed
1 week ago
3
[QUESTION] Effect of sequence parallel with dropout rng context
#1256
sbmaruf
closed
3 weeks ago
2
[QUESTION]Transformer Engine is totally a shit.
#1239
ZihaoZheng98
closed
1 month ago
0
[QUESTION] Do tp overlap support thd, whose sequence length is flexible?
#1238
wplf
closed
4 weeks ago
0
[QUESTION]Using FP8 OOM, otherwise --bf16 works well
#1237
yanchenmochen
closed
4 weeks ago
0
Megatron FP8 training is compatible with recompute?
#1236
yanchenmochen
closed
4 weeks ago
0
Add fallbacks for c++ extension + jit_fuser
#1235
marcromeyn
closed
1 month ago
1
[BUG] Cannot Save mamba model in distributed training
#1234
siriusctrl
opened
1 month ago
2
Make it an option to use TransformerEngine activation function in FFN block
#1233
guyueh1
opened
1 month ago
0
[QUESTION] How to load checkpoint saved in one parallel configuration (tensor/pipeline/data parallelism) can be loaded in a different parallel configuration ?
#1232
polisettyvarma
closed
4 weeks ago
1
[QUESTION]how to incorporate MOE into hybrid mamba efficiently
#1231
sunying2018
closed
4 weeks ago
0
Next