NVIDIA Megatron-LM issues

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

10.13k stars 2.28k forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

[QUESTION] Will the data get re-shuffled if the sequence length is modified during training?

#1002 SefaZeng closed 1 month ago
0
[QUESTION]Why is the operator of computation slower when computation overlaps with communication

#1001 yu-depend closed 1 month ago
0
[QUESTION] Training Llama3 70B on 16 x A100 only achieves low throughput of 20 TFLOPS

#1000 ZeroAGI closed 1 month ago
1
Why is gather_output not supported in ColumnParallelLinear when using sequence parallelism?

#999 mushan09 opened 1 month ago
0
[BUG] llava pipeline parallel initialization problem

#998 KookHoiKim opened 1 month ago
1
[QUESTION] Why and When dose matmul call different kernels？

#997 hxdtest closed 1 month ago
1
fix _te_version issue in transformer_engine.py get_cpu_offload_context()

#996 1195343015 closed 3 weeks ago
4
Fix FLOPs calculation

#995 janEbert closed 1 month ago
1
[QUESTION] How to freeze specific modules while training?

#994 wavy-jung closed 1 month ago
3
FASE 6 LILITI STK 3.6.9 INTELIGÊNCIA ARTIFICIAL ANTI CARBONO.

#993 felipeliliti opened 1 month ago
0
[BUG] error raised while converting llm to megatron

#992 KookHoiKim opened 1 month ago
0
[BUG] clip key mismatch

#991 KookHoiKim opened 1 month ago
1
No pre-norm for non-moe gpt style model when using TE-transformer layer spec???

#990 hityupeng opened 1 month ago
2
[QUESTION] Does it support Knowledge Distillation?

#989 mushan09 closed 1 month ago
1
[BUG] arguments of get_cpu_offload_context() in transformer_engine.py for different version of te

#988 1195343015 closed 3 weeks ago
4
add hoper llama golden with mcore calling stack

#987 yiakwy-xpu-ml-framework-team opened 1 month ago
5
[Bugfix] Fix typo in moe doc

#986 billishyahao opened 1 month ago
1
[QUESTION] glu activation with tensor parallel in GroupedMLP

#985 Teng-xu closed 3 weeks ago
6
[QUESTION]Splitting large document and bucketing

#975 shafiqabedin closed 3 weeks ago
0
[bugfix] Fix _warmup_jit_function

#973 taowangcheng opened 1 month ago
2
[bugfix] Fix the incorrect with-statement

#972 aaa123git opened 1 month ago
0
[QUESTION] Megatron-LM `DistributedOptimizer` or NeMo `MegatronDistributedFusedAdam` Optimizer?

#971 TJ-Solergibert closed 3 weeks ago
0
[QUESTION] Checkpoint storage format

#970 syx11237744 closed 3 weeks ago
0
[QUESTION]

#969 suzewei closed 3 weeks ago
0
nothing

#968 wangwz6666 closed 1 month ago
0
BitPipe_initial_version

#967 wuhouming closed 1 month ago
0
[QUESTION]How to convert a huggingface checkpoint, and also use PP > 1 or TP > 1

#966 sambar1729 closed 3 weeks ago
0
[QUESTION] About memory usage in dot_product_attention.py

#965 sambar1729 closed 1 month ago
1
[QUESTION] Asynchronous Checkpoint Saving

#964 zhaoyang-star closed 3 weeks ago
11
learning rate error when continue training

#963 TtCWH closed 1 month ago
2
Update README.md

#961 ArtificialZeng opened 2 months ago
0
fix typo in token_dispatcher.py

#960 xinqiu opened 2 months ago
1
transformer_engine import error

#959 yuvraj27khanna02 opened 2 months ago
2
(Pre training mamba with train.sh) Error : GPT2BPETokenizer : assert args.vocab_file is not None

#958 SkanderBS2024 opened 2 months ago
6
[BUG] Infinite Loop in `_get_num_epochs` Function of `GPTDataset` Class When `num_tokens_per_epoch` is Zero

#957 Dune-Z opened 2 months ago
1
fix llama3 checkpoint converter

#956 alex-ht closed 2 months ago
0
[BUG] MoE Router TopK algorithm is differeent from huggingface implement

#955 Au3C2 closed 1 month ago
1
[QUESTION] Why is `reset_attention_mask=False` by default?

#954 dtamayo-nlp closed 3 weeks ago
0
[QUESTION] One possible typo in docs/source/distrib_optimizer.md

#953 wplf closed 3 weeks ago
0
[BUG] Error pre-training BERT

#952 fabiancpl opened 2 months ago
2
Differnt Tokenizer

#951 dustinwloring1988 closed 3 weeks ago
0
[BUG] when use --use-mcore-models and --overlap-param-gather bug

#950 Kingsleyandher opened 2 months ago
2
[BUG]`examples/multimodal/combine_mistral_clip.sh` Vision model file mismatch.

#949 Baibaifan opened 2 months ago
1
[bugfix]: fixed combine_mistral_clip.sh

#948 Baibaifan closed 6 days ago
1
[QUESTION] About Optimizer & Params Offload

#946 shh2000 closed 2 months ago
1
ERROR: Could not find a version that satisfies the requirement triton==2.1.0 (from versions: none) "MAMBA"

#945 SkanderBS2024 opened 2 months ago
4
Distributed Mamba Training

#944 SkanderBS2024 opened 2 months ago
7
[BUG] Spelling mistake

#943 G-keng closed 2 months ago
1
[BUG]RuntimeError: CUDA error: device-side assert triggered

#942 wccccp closed 2 months ago
1
[DOC] Fix wrong llama2 pretrain url in README

#941 lausannel opened 2 months ago
1

Previous Next