aws-samples / awsome-distributed-training

Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.
MIT No Attribution
182 stars 74 forks source link

Squashfs volumes are not mountable thru enroot on HyperPod #193

Closed cfregly closed 6 months ago

cfregly commented 6 months ago

In an attempt to improve performance, the user has copied the SquashFS image (which contains their dataset) onto FSx - and trying to mount the SquashFS image into the docker container with --container-mounts.

However, enroot/pyxis could not access the mounted SquashFS files as enroot/pyxis does not support SquashFS mounts, it seems.

cfregly commented 6 months ago

This is a known issue with enroot/pyxis described here: https://github.com/NVIDIA/enroot/issues/180#issuecomment-1971965852

Some options are the following: 1/ Extract the dataset onto FSx, but this creates many small files which may not be optimal.

2/ Extract the dataset onto each worker node's local NVMe disk for better performance

3/ Use squashfuse and autofs per https://github.com/NVIDIA/enroot/issues/180#issuecomment-1971965852