hyperpod Search Results

38 results
for hyperpod

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

aws-samples/awsome-distributed-training #232

Enable autoresume for all Slurm examples

We should add the following snippet to all Slurm examples so that if it's a hyperpod cluster it'll automatically add the `--auto-resume=1` flag. This needs to be tested for all examples, see https://g…

sean-smith updated 5 days ago
1
aws-samples/awsome-distributed-training #134

Remove git dependency

https://github.com/aws-samples/awsome-distributed-training/blob/d66304ff17229dd857397d725ed9e168bc41167f/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_enroot_pyxis.sh…

sean-smith updated 2 weeks ago
4
hyperhq/runv #366

Bug report: leaking container while doing pressure test

- Test case: 1. Test runv-containerd with docker daemon 2. start 1000 container, then "docker rm" all - Expected result: All containers are removed and no container is left - Actual result: Some c…

WeiZhang555 updated 6 years ago
3
aws-samples/awsome-distributed-training #335

Change from Dockerhub to Public ECR

The following line can cause rate limits from Dockerhub: https://github.com/aws-samples/awsome-distributed-training/blob/1d15afd847f7810125c60353a09a1757188ba7aa/1.architectures/5.sagemaker-hyperpo…

sean-smith updated 1 month ago
1
aws-samples/awsome-distributed-training #324

HyperPod Lifecycle Script install_dcgm_exporter.sh is failin…

When the [install_dcgm_exporter.sh](https://github.com/aws-samples/awsome-distributed-training/blob/main/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/utils/install_dcgm_exporter.s…

nghtm updated 1 month ago
1
aws-samples/awsome-distributed-training #180

Add time sync checks across all nodes to verify nodes aren't…

In rare cases, PyTorch will timeout due to drift in system clock across nodes. A pre-check may be useful to diagnose this issue before training run starts. Add test to hyperpod-precheck.py

DarkSector updated 4 weeks ago
3
cloud-custodian/cloud-custodian #9355

Add support for SageMaker Hyperpod Cluster resource

### Describe the feature Would be useful to support SageMaker Hyperpod Clusters as a `aws.sagemaker-cluster` resource type ### Extra information or context _No response_

mattheidelbaugh updated 3 months ago
1
aws-samples/awsome-distributed-training #204

HyperPod cluster fails to create

HyperPod cluster fails to create - ROLLBACK

cfregly updated 3 months ago
1
aws-samples/awsome-distributed-training #194

Cannot download Docker image from ECR from within HyperPod

Cannot download Docker image from ECR from within HyperPod

cfregly updated 3 months ago
1
aws-samples/awsome-distributed-training #186

Lifecycle scripts sometimes fail when creating a new HyperPo…

Lifecycle scripts sometimes fail when creating a new HyperPod cluster

cfregly updated 3 months ago
1

上一页 1...1 2 3 4...4 下一页

38 results for hyperpod

38 results
for hyperpod