issues
search
aws-samples
/
awsome-distributed-training
Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
203
stars
86
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Update NCCL tests for both slurm and k8s
#506
KeitaW
opened
2 days ago
1
fix: Updating Dockerfile to pin versions and fix the exampel
#505
mnuyens
closed
3 days ago
0
pin prometheus to version 2.55.1
#504
nghtm
closed
4 days ago
1
Include full Dockerfile for FSDP example
#503
sean-smith
closed
3 days ago
2
name SSM doc
#502
sean-smith
opened
5 days ago
1
Change aws ofi plugin version 1.13.0
#501
mhuguesaws
opened
6 days ago
2
check for ssm document
#500
sean-smith
closed
6 days ago
1
Raising the default Kubernetes version in HyperPod EKS template to 1.30
#499
shimomut
closed
1 week ago
0
add link to the HP EKS workshop
#498
KeitaW
closed
1 week ago
1
create ~/.ssh/config if it doesn't exist
#497
sean-smith
closed
1 week ago
1
Issues running easy-ssh.sh script locally
#496
nghtm
closed
6 days ago
1
Update 0.nvcr-pytorch-aws.dockerfile
#495
nghtm
closed
1 week ago
0
small improvement to login
#494
sean-smith
closed
1 week ago
0
Update easy-ssh.sh
#493
sean-smith
closed
1 week ago
0
Change slurm exporter to Slinky slurm exporter
#492
mhuguesaws
opened
1 week ago
2
FSDP EKS Example failing with: module 'torch.library' has no attribute 'register_fake'
#491
nghtm
opened
2 weeks ago
1
easy-ssh.sh switch to default (ubuntu) user
#490
sean-smith
closed
1 week ago
0
Synchronizing the CF template with the one we have used in the workshop
#489
shimomut
closed
2 weeks ago
0
Improvements/nccl efa update
#488
mhuguesaws
closed
2 weeks ago
0
Automate onboarding smhp slurm
#487
amanshanbhag
closed
2 weeks ago
3
Generate keypair by default
#486
sean-smith
closed
1 week ago
2
add keypair to ~/.ssh/config file
#485
sean-smith
closed
2 weeks ago
0
Added onboarding automation script
#484
amanshanbhag
closed
2 weeks ago
0
Create fsdp-simple.yaml
#483
KeitaW
closed
2 weeks ago
0
Unifying CF template for FSxL for HyperPod Slurm and EKS
#482
shimomut
closed
2 weeks ago
3
Add llama3.1 training support in FSDP test case
#481
KeitaW
opened
2 weeks ago
0
FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. in FSDP test case
#480
KeitaW
opened
2 weeks ago
0
Deprecated dataset "c4"
#479
KeitaW
opened
2 weeks ago
0
Sub optimal performance in FSDP test case.
#478
KeitaW
opened
2 weeks ago
0
Add github stars to readme file
#477
perifaws
closed
2 weeks ago
0
Remove cuda compat
#476
jahaniam
closed
3 weeks ago
1
EFA Does not Work on New NVIDA Driver and CUDA Versions `system has unsupported display driver `
#475
kamal-rahimi
closed
3 weeks ago
11
Update README.md on NCCL tests example to include MPIOperator setup
#474
KeitaW
closed
3 weeks ago
0
don't display login in motd if no login node
#473
sean-smith
closed
2 weeks ago
0
nemofw-training is probably deprecated
#472
KeitaW
opened
3 weeks ago
0
HyperPod Mountpoint for s3 Lifecycle Script
#471
nghtm
closed
3 weeks ago
2
Head node and login node ip in motd
#470
sean-smith
closed
4 weeks ago
3
Update requirements.txt
#469
nghtm
closed
3 weeks ago
0
Update motd.sh
#468
nghtm
closed
4 weeks ago
1
FSDP sample fails with CUDA initialization error on HyperPod EKS
#467
shimomut
opened
4 weeks ago
6
Change aws-ofi-plugin for EFA 1.35.0 due to regression
#466
mhuguesaws
closed
1 month ago
0
Spelling nits in README.md
#465
jimburtoft
closed
1 month ago
0
Update pcluster architecture guidance
#464
KeitaW
opened
1 month ago
2
Providing updated pcluster guidance for GENIAC participants.
#463
KeitaW
closed
1 month ago
0
add GPU accounting for SMHP
#462
KeitaW
opened
1 month ago
1
Bump deepspeed from 0.9.2 to 0.15.1 in /3.test_cases/13.SM-dataparallel-deepspeed/code
#461
dependabot[bot]
closed
1 month ago
0
FSDP sample fails model validation
#460
shimomut
opened
1 month ago
0
Adding the CF template for FSx Lustre for HyperPod EKS workshop
#459
shimomut
closed
1 month ago
5
Install numpy v1 specifically for the FSDP sample app
#458
shimomut
closed
1 month ago
0
17.SM-modelparallelv2 uses pytorch binary that depends on deprecated conda packages
#457
junpuf
opened
1 month ago
1
Next