issues
search
argonne-lcf
/
Megatron-DeepSpeed
Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
7
stars
8
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Added Utils for tokenization and data file list generation
#57
zhenghh04
opened
2 days ago
0
Train skip range
#56
saforem2
closed
2 days ago
0
fixed dftracer compatibility
#55
zhenghh04
opened
1 week ago
0
add agpt inference scripts
#54
vksastry
opened
1 week ago
0
hf cp conversion and inference scripts added
#53
vksastry
closed
1 week ago
0
Fixed Data Loader issue for TP>1, PP>1
#52
zhenghh04
closed
2 weeks ago
0
Move `ALCF/mds_to_hf.py` to `mds_to_hf.py`
#51
saforem2
closed
3 weeks ago
0
Feature: Checkpoint saving improvements
#50
hatanp
opened
1 month ago
0
Update README.md
#49
saforem2
closed
1 month ago
0
Pull in DPO loss
#48
saforem2
opened
2 months ago
0
Feature: Multimodality and ViT
#47
hatanp
opened
2 months ago
0
Feature: State space models
#46
hatanp
opened
2 months ago
0
Sequence parallelism
#45
hatanp
opened
2 months ago
2
Feature: Mixture of experts with token dropping
#44
hatanp
opened
2 months ago
0
Merge `alcf-helpers-patch-1` into `main`
#43
saforem2
closed
1 month ago
0
Performance measurement and profiling of 7B and 70B models on different systems
#42
zhenghh04
opened
2 months ago
0
PyTorch profiler on all platforms
#41
hatanp
opened
2 months ago
1
Develop pre-/mid-execution test harness
#40
nscottnichols
opened
2 months ago
0
Evaluation of tokenizers and vocab sizes
#39
venkat-1
opened
2 months ago
0
Performance comparison of optimizers, such as Sophia, Lamb and AdamW, and identify appropriate hyper-parameter settings
#38
venkat-1
opened
2 months ago
0
Collecting system metrics during runs.
#37
venkat-1
opened
2 months ago
4
Merge `alcf-helpers-patch` into `main`
#36
saforem2
closed
2 months ago
0
Pull in changes from `microsoft/Megatron-DeepSpeed`
#35
saforem2
opened
2 months ago
0
Create `alcf-startup-time`
#34
saforem2
closed
2 months ago
0
Update `ALCF/README.md`
#33
saforem2
closed
2 months ago
0
Merge `alcf-aurora-kvs-fix` into `main`
#32
saforem2
closed
2 months ago
0
Merge `alcf-helpers-patch-1` into `main`
#31
saforem2
closed
2 months ago
0
Distributed data lists
#30
saforem2
closed
2 months ago
0
Create `alcf-patch-1` branch
#29
saforem2
closed
2 months ago
0
Add `LLAMA_MODE` toggle
#28
saforem2
closed
2 months ago
2
Update instructions for Aurora
#27
saforem2
closed
3 months ago
1
Update `ALCF/README.md`
#26
saforem2
closed
3 months ago
0
Pull `aurora-dfl-fix` branch into `main`
#25
saforem2
closed
3 months ago
0
Fix `ezpz_{save,get}jobenv` in `ALCF/helpers.sh`
#24
saforem2
closed
3 months ago
0
Aurora updates
#23
saforem2
closed
3 months ago
0
Pull in `sequence-parallel` changes
#22
saforem2
closed
3 months ago
0
Distributed loading v2
#21
zhenghh04
closed
3 months ago
0
Pfw trace
#20
zhenghh04
closed
3 months ago
0
convert MDS checkpoint to Hf Llama model
#19
vksastry
closed
3 months ago
2
Concat datasets that belongs the same corpus
#18
zhenghh04
closed
3 months ago
0
Merge in `tokenizer-tests` branch into `main`
#17
saforem2
closed
3 months ago
0
Distributed data loading
#16
zhenghh04
closed
2 months ago
1
Fix path in `prof.export_chrome_trace()` from `pretrain_gpt_alcf.py`
#15
saforem2
closed
3 months ago
0
Sunspot frameworks tests
#14
saforem2
closed
3 months ago
0
`flash-attn` fix + new Frameworks on Sunspot
#13
saforem2
closed
4 months ago
0
[WIP] Async checkpointing support
#12
zhenghh04
opened
4 months ago
0
Merge `polaris-cuda122` branch into main
#11
saforem2
closed
4 months ago
0
Merge `alcf-tests` into `main`
#10
saforem2
closed
4 months ago
0
Remove apex deps
#9
saforem2
closed
4 months ago
2
Update `ALCF/helpers.sh`
#8
saforem2
closed
4 months ago
0
Next