NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
273 stars 31 forks source link

Override container environment variable from the srun command-line #26

Closed cponder closed 1 year ago

cponder commented 3 years ago

Using the form

  srun --mpi=pmix  --export RETRY_COUNT=2000 --container-image=...  .....

the command-line RETRY_COUNT is discarded if the variable is already set inside the Docker image:

  ENV RETRY_COUNT  1000

I think this is counter-intuitive, since command-line settings are applied to the specific invocation whereas the Docker setting is fixed across all invocations. If you want to maintain the current behavior for backward-compatibility purposes, you could always add a different flag like

    --container-env-override RETRY_COUNT=2000
flx42 commented 3 years ago

Yes, it was discussed in the past, as you are aware. I couldn't find a satisfying solution at the time, unfortunately.

You can't distinguish between those two approaches AFAIK:

$ srun --export FOO=bar ...
# is equivalent to
$ FOO=bar srun ...

The plugin can't blindly have the host environment take precedence over the container environment, that would be a bad idea for common environment variables like PATH or LD_LIBRARY_PATH, you do want the value in the container for those.

That means you would need to have --container-env like you mentioned. And it would have to behave like --export which brings its own set of problems, for example you can't have a comma in a value since it's used as the separator:

$ srun --export 'ALL,FOO=bar,baz' -A admin -p admin bash -c 'echo $FOO'
bar

How Slurm handles command-line arguments is fundamentally different from Docker, you can't do that:

$ srun --export 'FOO=bar' --export 'FOO2=baz'

Only the last argument will be effective, that's also annoying.

So, we would have to support exactly the same format than --export:

$ FOO=bar,baz srun --export 'ALL,FOO' -A admin -p admin bash -c 'echo $FOO'
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
bar,baz

So yes, that's doable, and we might implement it, but that's a lot of extra complexity for something that could simply be:

$ srun bash -c 'RETRY_COUNT=2000 ./myapp'
ltalirz commented 1 year ago

I just wanted to support this feature request. Since many containers are designed to be configured via environment variables (which can be a handful), it would really be great to have a more structured alternative to pasting the variables and their values into bash -c '...'

That means you would need to have --container-env like you mentioned. And it would have to behave like --export

Would it? pyxis being an enroot plugin, at least from my perspective it would be perfectly adequate if there was a separate --container-env option would work just like enroot --env

flx42 commented 1 year ago

Would it? pyxis being an enroot plugin, at least from my perspective it would be perfectly adequate if there was a separate --container-env option would work just like enroot --env

What I mean is that you won't be able to do --container-env FOO=123 --container-env BAR=456 like you would do with docker run -e or enroot --env, i.e. only the latest --container-env would be effective. This is how Slurm behaves for all its arguments (not just arguments from SPANK plugins).

ltalirz commented 1 year ago

Thanks for the explanation, I didn't get that this applies to all arguments.

cponder commented 1 year ago

I prefer this convention anyway

export FOO=bar            # Some description here.
export FOO2=baz           # And describe this too.
srun --container-env 'FOO,FOO2' ...

since (a) using explicit assignments on the command-line tends to look like junk, (b) it's easier to pass strings that embed quotes and wildcards this way, and (c) there's a place to put some explanation there.

flx42 commented 1 year ago

Finally implemented in https://github.com/NVIDIA/pyxis/commit/5164d82a6b8099c294fb512437afd85542578df6, please test when you can (but it's not part of a pyxis release yet), and thanks for pushing on this.

cponder commented 9 months ago

Have you added it to a release yet?

cponder commented 9 months ago

Another observation -- the main target of this would be to extend the PATH and LD_LIBRARY_PATH inside the container. Right now I'm using a wrapper-script to do this

srun ... env.adjust -prepend LD_LIBRARY_PATH %{ENV_NVHPC_ROOT}/compilers/lib:  --prepend PATH %{NVHPC_ROOT}/compilers/bin:%{HPCX}/ompi/bin: ...

which extends the paths without losing the components that were assigned inside the container.

I don't think my --container-env proposal would be as effective since it's only meant to over-write the variables.

flx42 commented 9 months ago

Have you added it to a release yet?

Yes: https://github.com/NVIDIA/pyxis/releases/tag/v0.15.0

Another observation -- the main target of this would be to extend the PATH and LD_LIBRARY_PATH inside the container. I don't think my --container-env proposal would be as effective since it's only meant to over-write the variables.

I would have to think about it, but the scope seems fairly narrow for this use case, and I also want to avoid feature creep.