NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
273 stars 31 forks source link

Start container directly from SquashFS files #24

Closed jfolz closed 4 years ago

jfolz commented 4 years ago

I managed to get overlayfs with SquashFS files working for enroot, so on a worker node I can enroot start image.sqsh directly. for a large image like Pytorch converted from NGC this takes roughly 35 seconds, which is a bit longer than docker but still acceptable.

However, with srun --container-image=image.sqsh the task on the same worker node will first unsquash the file and start the container afterwards. This takes almost 7 minutes and a lot of temp disk space. Is there a way to tell pyxis to start the image directly?

flx42 commented 4 years ago

Is there a way to tell pyxis to start the image directly?

Sorry, there is no way to do that today. It will indeed create the container by extracting the squashfs file. The reason is that we rarely use this pattern on our cluster, fuse-overlayfs is usually slower than a regular filesystem, and we had some subtle bugs in the past with it (but it depends on your version, of course). So I'm not against having support for this use case, but I can't tell you if or when it can be added.

But 7 minutes is very slow, perhaps take a look at this section of the wiki: https://github.com/NVIDIA/pyxis/wiki/Setup#enroot-configuration-example You might be able to significantly speed up the process by putting some temporary directories to a tmpfs. Also, tuning the compression options might help. You can use a different compression format (like zstd or lz4), or use no compression at all.

The squashfs extraction should also be multi-threaded, what kind of system are you using? Is it a DGX-1, DGX-2, DGX-A100 or something else? Maybe your Slurm job was limited to a single core, it would make the extraction slower too.

jfolz commented 4 years ago

Sorry, there is no way to do that today. It will indeed create the container by extracting the squashfs file. The reason is that we rarely use this pattern on our cluster, fuse-overlayfs is usually slower than a regular filesystem, and we had some subtle bugs in the past with it (but it depends on your version, of course). So I'm not against having support for this use case, but I can't tell you if or when it can be added.

I agree, the concept is quite neat (very docker-like), but I am also very much not a fan of subtle bugs. We'll keep an eye on this.

But 7 minutes is very slow, perhaps take a look at this section of the wiki: https://github.com/NVIDIA/pyxis/wiki/Setup#enroot-configuration-example You might be able to significantly speed up the process by putting some temporary directories to a tmpfs. Also, tuning the compression options might help. You can use a different compression format (like zstd or lz4), or use no compression at all.

Thank you for the hints. So far we've been running the default config for enroot and pyxis.

The squashfs extraction should also be multi-threaded, what kind of system are you using? Is it a DGX-1, DGX-2, DGX-A100 or something else? Maybe your Slurm job was limited to a single core, it would make the extraction slower too.

Custom server for now since we're still evaluating and all DGX are busy :) unsquashfs looks fully IO-bound, so I imagine the nvme raid would speed this process up considerably.

3XX0 commented 4 years ago

Also note that the --container-name feature of Pyxis wouldn't work if backed by fuse-overlay

jfolz commented 4 years ago

After some config issues (since we compiled from source enroot was looking for /usr/local/etc/enroot/enroot.conf instead of /etc/enroot/enroot.conf) startup times with tmpfs storage are down to 18 seconds from local SATA raid and 27 seconds from BeeGFS. We might try some tuning still, but those times are workable 👍