-
-
In an attempt to improve performance, the user has copied the SquashFS image (which contains their dataset) onto FSx - and trying to mount the SquashFS image into the docker container with --containe…
-
Cluster creation fails, but CloudWatch logs are empty for HyperPod.
We see a message to find error details in CloudWatch but CloudWatch does not display any logs.
-
The worker nodes on the HyperPod cluster currently have a fixed root volume size of 100GB and may run out of disk space when performing large docker/pyxis builds, for example, which use are configured…
-
Hi All!
```
runv version 0.8.1, commit: v0.8.1-61-g773c40b
```
```
Docker version 17.05.0-ce, build 89658be
```
Runv hase been started as
```
sudo ./runv --debug --driver kvmtool --kernel /…
-
Should login node start `slurmd`?
https://github.com/aws-samples/awsome-distributed-training/blob/76f995674b1c2e07e25814b15262baac8abc2bcd/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base…
-
One of the cluster GPUs is failing, it seems.
-
Hi All!
My env
```
runv -v
runv version 0.8.0
docker -v
Docker version 1.11.0, build 4dc5990
uname -a
Linux runv 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_…
-
"FailureMessage": "Instance i-XXX failed to provision with the following error: \"Lifecycle scripts did not run successfully. Ensure the scripts exist in provided S3 path, are accessible, and run with…
-
How do i increase IO throughput for my training and tuning jobs?