-
We should add the following snippet to all Slurm examples so that if it's a hyperpod cluster it'll automatically add the `--auto-resume=1` flag. This needs to be tested for all examples, see https://g…
-
https://github.com/aws-samples/awsome-distributed-training/blob/d66304ff17229dd857397d725ed9e168bc41167f/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_enroot_pyxis.sh…
-
- Test case:
1. Test runv-containerd with docker daemon
2. start 1000 container, then "docker rm" all
- Expected result:
All containers are removed and no container is left
- Actual result:
Some c…
-
The following line can cause rate limits from Dockerhub:
https://github.com/aws-samples/awsome-distributed-training/blob/1d15afd847f7810125c60353a09a1757188ba7aa/1.architectures/5.sagemaker-hyperpo…
-
When the [install_dcgm_exporter.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_dcgm_exporter.s…
nghtm updated
1 month ago
-
In rare cases, PyTorch will timeout due to drift in system clock across nodes. A pre-check may be useful to diagnose this issue before training run starts.
Add test to hyperpod-precheck.py
-
### Describe the feature
Would be useful to support SageMaker Hyperpod Clusters as a `aws.sagemaker-cluster` resource type
### Extra information or context
_No response_
-
HyperPod cluster fails to create - ROLLBACK
-
Cannot download Docker image from ECR from within HyperPod
-
Lifecycle scripts sometimes fail when creating a new HyperPod cluster