NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
266 stars 30 forks source link

Not attaching to running container when using --container-name #80

Closed hendraet closed 2 years ago

hendraet commented 2 years ago

When running two jobs using the same --container-name, the latter job will attach to the running container. Is it possible to disable this behavior and run two separate containers based off the same container name?

flx42 commented 2 years ago

Sorry, this is not possible right now. But I agree this should be possible, I will add that feature soon (can't tell you exactly when though).

Jopyth commented 2 years ago

Sounds great!

Maybe a bit background info about our use case: we appreciate the "caching" functionality of named containers which greatly reduces container creation and startup time by reusing the extracted filesystem's files (from previous jobs). It works great already for consecutive jobs, but it can get messy with concurrent jobs.

Often in our case it is sufficient if the container's filesystem is read-only as well (e.g. mounting some directories for writing job results).

flx42 commented 2 years ago

@hendraet @Jopyth please test https://github.com/NVIDIA/pyxis/commit/d7ae226cf6ecf9b515a6d625baee9de36f703341 on the current master branch

flx42 commented 2 years ago

Something like --container-name pytorch:no_exec should do what you want.

Jopyth commented 2 years ago

Thank you @flx42, I have deployed that commit to our cluster.