issues
search
NVIDIA
/
Megatron-LM
Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.13k
stars
2.28k
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
[QUESTION] Will the data get re-shuffled if the sequence length is modified during training?
#1002
SefaZeng
closed
1 month ago
0
[QUESTION]Why is the operator of computation slower when computation overlaps with communication
#1001
yu-depend
closed
1 month ago
0
[QUESTION] Training Llama3 70B on 16 x A100 only achieves low throughput of 20 TFLOPS
#1000
ZeroAGI
closed
1 month ago
1
Why is gather_output not supported in ColumnParallelLinear when using sequence parallelism?
#999
mushan09
opened
1 month ago
0
[BUG] llava pipeline parallel initialization problem
#998
KookHoiKim
opened
1 month ago
1
[QUESTION] Why and When dose matmul call different kernels?
#997
hxdtest
closed
1 month ago
1
fix _te_version issue in transformer_engine.py get_cpu_offload_context()
#996
1195343015
closed
3 weeks ago
4
Fix FLOPs calculation
#995
janEbert
closed
1 month ago
1
[QUESTION] How to freeze specific modules while training?
#994
wavy-jung
closed
1 month ago
3
FASE 6 LILITI STK 3.6.9 INTELIGÊNCIA ARTIFICIAL ANTI CARBONO.
#993
felipeliliti
opened
1 month ago
0
[BUG] error raised while converting llm to megatron
#992
KookHoiKim
opened
1 month ago
0
[BUG] clip key mismatch
#991
KookHoiKim
opened
1 month ago
1
No pre-norm for non-moe gpt style model when using TE-transformer layer spec???
#990
hityupeng
opened
1 month ago
2
[QUESTION] Does it support Knowledge Distillation?
#989
mushan09
closed
1 month ago
1
[BUG] arguments of get_cpu_offload_context() in transformer_engine.py for different version of te
#988
1195343015
closed
3 weeks ago
4
add hoper llama golden with mcore calling stack
#987
yiakwy-xpu-ml-framework-team
opened
1 month ago
5
[Bugfix] Fix typo in moe doc
#986
billishyahao
opened
1 month ago
1
[QUESTION] glu activation with tensor parallel in GroupedMLP
#985
Teng-xu
closed
3 weeks ago
6
[QUESTION]Splitting large document and bucketing
#975
shafiqabedin
closed
3 weeks ago
0
[bugfix] Fix _warmup_jit_function
#973
taowangcheng
opened
1 month ago
2
[bugfix] Fix the incorrect with-statement
#972
aaa123git
opened
1 month ago
0
[QUESTION] Megatron-LM `DistributedOptimizer` or NeMo `MegatronDistributedFusedAdam` Optimizer?
#971
TJ-Solergibert
closed
3 weeks ago
0
[QUESTION] Checkpoint storage format
#970
syx11237744
closed
3 weeks ago
0
[QUESTION]
#969
suzewei
closed
3 weeks ago
0
nothing
#968
wangwz6666
closed
1 month ago
0
BitPipe_initial_version
#967
wuhouming
closed
1 month ago
0
[QUESTION]How to convert a huggingface checkpoint, and also use PP > 1 or TP > 1
#966
sambar1729
closed
3 weeks ago
0
[QUESTION] About memory usage in dot_product_attention.py
#965
sambar1729
closed
1 month ago
1
[QUESTION] Asynchronous Checkpoint Saving
#964
zhaoyang-star
closed
3 weeks ago
11
learning rate error when continue training
#963
TtCWH
closed
1 month ago
2
Update README.md
#961
ArtificialZeng
opened
2 months ago
0
fix typo in token_dispatcher.py
#960
xinqiu
opened
2 months ago
1
transformer_engine import error
#959
yuvraj27khanna02
opened
2 months ago
2
(Pre training mamba with train.sh) Error : GPT2BPETokenizer : assert args.vocab_file is not None
#958
SkanderBS2024
opened
2 months ago
6
[BUG] Infinite Loop in `_get_num_epochs` Function of `GPTDataset` Class When `num_tokens_per_epoch` is Zero
#957
Dune-Z
opened
2 months ago
1
fix llama3 checkpoint converter
#956
alex-ht
closed
2 months ago
0
[BUG] MoE Router TopK algorithm is differeent from huggingface implement
#955
Au3C2
closed
1 month ago
1
[QUESTION] Why is `reset_attention_mask=False` by default?
#954
dtamayo-nlp
closed
3 weeks ago
0
[QUESTION] One possible typo in docs/source/distrib_optimizer.md
#953
wplf
closed
3 weeks ago
0
[BUG] Error pre-training BERT
#952
fabiancpl
opened
2 months ago
2
Differnt Tokenizer
#951
dustinwloring1988
closed
3 weeks ago
0
[BUG] when use --use-mcore-models and --overlap-param-gather bug
#950
Kingsleyandher
opened
2 months ago
2
[BUG]`examples/multimodal/combine_mistral_clip.sh` Vision model file mismatch.
#949
Baibaifan
opened
2 months ago
1
[bugfix]: fixed combine_mistral_clip.sh
#948
Baibaifan
closed
6 days ago
1
[QUESTION] About Optimizer & Params Offload
#946
shh2000
closed
2 months ago
1
ERROR: Could not find a version that satisfies the requirement triton==2.1.0 (from versions: none) "MAMBA"
#945
SkanderBS2024
opened
2 months ago
4
Distributed Mamba Training
#944
SkanderBS2024
opened
2 months ago
7
[BUG] Spelling mistake
#943
G-keng
closed
2 months ago
1
[BUG]RuntimeError: CUDA error: device-side assert triggered
#942
wccccp
closed
2 months ago
1
[DOC] Fix wrong llama2 pretrain url in README
#941
lausannel
opened
2 months ago
1
Previous
Next