issues
search
aws-samples
/
awsome-distributed-training
Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
134
stars
57
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Relocate docker/pyxis to /opt/sagemaker directory
#367
sean-smith
opened
3 days ago
1
Update SMPv2 conda setup script with latest PT2.3.1 TSM2.4.0
#366
viclzhu
opened
3 days ago
0
Bump transformers from 4.31.0 to 4.38.0 in /3.test_cases/22.SMHP-trainium-llama3
#365
dependabot[bot]
opened
4 days ago
0
Add fsdp and smpv2 example for EKS
#364
iankouls-aws
closed
3 days ago
3
Trainium llama3
#363
syedazi
closed
4 days ago
1
Eks examples - FSDP example and documentation added
#362
iankouls-aws
opened
1 week ago
0
Upgrade packages in container to resolve CVEs
#361
iankouls-aws
closed
1 week ago
0
Bump scikit-learn from 1.2.1 to 1.5.0 in /3.test_cases/4.DDP
#360
dependabot[bot]
closed
1 week ago
0
Neuron distributed
#359
KeitaW
opened
2 weeks ago
1
Fix readme file in NCCL tests
#358
perifaws
closed
2 weeks ago
0
Remove borked ascii logo, once more
#357
perifaws
closed
2 weeks ago
0
Fix/readme - fix borked logo
#356
perifaws
closed
2 weeks ago
0
Change readme to refer to recent test cases & assets
#355
perifaws
closed
2 weeks ago
0
Warning for maximum sequence length when running FSDP Llama2 example
#354
amanshanbhag
opened
2 weeks ago
0
PlacementGroup option for "Capacity Blocks for ML"
#353
liyier90
opened
3 weeks ago
2
Added EFA Node Exporter for EKS
#352
awsankur
closed
3 weeks ago
0
Sh bad substition error
#351
sean-smith
opened
3 weeks ago
0
Efa node exporter eks
#350
awsankur
closed
3 weeks ago
0
Update README.md with typo in hyperlink
#349
awsankur
closed
3 weeks ago
0
fix script name in 1.torch-screen.sbatch
#348
KeitaW
closed
4 weeks ago
0
add metrics on the headnode
#347
sean-smith
closed
1 month ago
0
Pcluster easy-ssh.sh
#346
sean-smith
closed
1 month ago
0
Bump requests from 2.31.0 to 2.32.0 in /3.test_cases/14.bionemo
#345
dependabot[bot]
closed
1 month ago
0
Bump requests from 2.31.0 to 2.32.0 in /3.test_cases/9.nemo-multimodal
#344
dependabot[bot]
closed
1 month ago
0
Nsight
#343
awsankur
closed
3 weeks ago
0
Mamba
#342
syedazi
closed
2 weeks ago
1
End-to-End LLM Model Development with Torchtitan and Torchtune
#341
KeitaW
opened
1 month ago
4
FSDP Training Job failing on Validation Step (Batch 500)
#340
nghtm
opened
1 month ago
0
Add in lt for p5
#339
sean-smith
closed
1 month ago
0
Feature/ldap server
#338
mhuguesaws
closed
1 month ago
0
Deleting --cpus-per-task in PyTorch DDP on CPU for Docker sample
#337
shimomut
closed
1 month ago
0
Update Dockerfile efa exporter public ecr
#336
mhuguesaws
closed
1 month ago
0
Change from Dockerhub to Public ECR
#335
sean-smith
closed
1 month ago
1
EKS CB fix for un-managed Node group
#334
sean-smith
closed
1 month ago
0
Update README.md for Hyperpod
#333
nghtm
closed
1 month ago
0
Update to Ubuntu 20.04
#332
sean-smith
closed
1 month ago
0
Llama training with FP8
#331
pbelevich
opened
1 month ago
2
Check if wget command installed in easy-setup.sh
#330
KeitaW
closed
1 month ago
0
NCCL Test Script for AMI
#329
sean-smith
closed
1 month ago
0
NCCL Tests for AWS ParallelCluster
#328
sean-smith
closed
1 month ago
1
FSDP with meta device requires sync_module_states=True
#327
pbelevich
closed
1 month ago
2
updates to HyperPod Observability Lifecycle scripts
#326
nghtm
closed
1 month ago
0
update setup conda env script
#325
jasonlee-sf
closed
1 month ago
0
HyperPod Lifecycle Script install_dcgm_exporter.sh is failing on g5.48xlarges
#324
nghtm
closed
1 month ago
1
pytorch-screen.py: fix attribute error (backend einsum) on newer pytorch
#323
verdimrc
closed
1 month ago
0
Example for benchmarking ML worloads using Torch Profiler and NSight
#322
syedazi
opened
1 month ago
0
Updated HyperPod architecture README to explain how to update config.py
#321
shimomut
closed
1 month ago
0
Fix `easy-setup.sh` + make `region` and `profile` configurable in `validate-config.py`
#320
KeitaW
closed
1 month ago
2
Bump pcluster version to 3.9.1
#319
sean-smith
opened
1 month ago
0
add missing argument to help in easy-setup.ssh
#318
KeitaW
closed
1 month ago
0
Next