aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
136 stars 58 forks source link

Cluster node runs out of EBS disk space #187

Closed cfregly closed 4 months ago

cfregly commented 4 months ago

The worker nodes on the HyperPod cluster currently have a fixed root volume size of 100GB and may run out of disk space when performing large docker/pyxis builds, for example, which use are configured to use the root volume by default.

cfregly commented 4 months ago

Support for larger root volume size is in the works. In the meantime, you can use the FSx for Lustre shared file system - or the local NVMe volume as a workaround.

You would need to modify the docker working directory to the /fsx mount or NVMe local disk. Or you can also mount your home directory to /fsx or NVMe, as well.