NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

Don't get stdout/stderr output from entry point script #137

Open sphuber opened 2 months ago

sphuber commented 2 months ago

I am running a container with an entrypoint script. If I run it normally through Docker, the stdout output is forwarded normally, but when running through pyxis, I see no output. Docker file looks as:

FROM ubuntu:22.04
COPY init.sh /opt/init.sh
ENTRYPOINT [ "/opt/init.sh" ]
CMD [ "/bin/bash" ]

where init.sh is

#!/bin/bash
echo "TESTING"
exec "$@"

When I run with Docker, I see TESTING in the output as expected:

$ docker build . -t test
$ docker run test:latest
TESTING

But when running through pyxis I get:

$ enroot import -o ./test.sqsh dockerd://test
$ srun --container-image=./test.sqsh --container-entrypoint hostname
srun: job 19314 queued and waiting for resources
srun: job 19314 has been allocated resources
some-hostname

Is this expected? Is there some way to have the stdout/stderr from the entrypoint script still visible?

flx42 commented 2 months ago

Yes that's expected, even if that is sometimes unfortunate. The container launch is done from within the plugin and its output is hidden from the job, unless there is an error. This was done because some containers have a very verbose entrypoint, and this verbosity would be multiplied by the number of nodes you are running on.

My recommendation would be to manually run the entrypoint in your command, something like this:

$ srun --no-container-entrypoint --container-image=./test.sqsh /opt/init.sh hostname
sphuber commented 2 months ago

Thanks for your quick reply. Unfortunately, I don't actually control the actual srun call. It is being run through some other program that runs sbatch. So I cannot pass the init script as a positional argument. Is there no way around this to get the output to be forwarded still? Seems like that would be a common use case, would it not?

Also, are you sure that errors are forwarded? I tested this by adding echo SOME ERROR >&2 to the init script and I don't see that message.

flx42 commented 2 months ago

Is there no way around this to get the output to be forwarded still? Seems like that would be a common use case, would it not?

Not right now, no.

Seems like that would be a common use case, would it not?

You are not the first one to wonder where stdout/stderr are going for the entrypoint, but I haven't heard other users mention that they can't run the entrypoint manually.

Also, are you sure that errors are forwarded?

Sorry, I mean that if the container import/create/start fail, then the log will be printed, so that users can debug the error (e.g. if it was a typo in the image name).

Clarkszw commented 4 days ago

I can run the entrypoint manually during container development, but it would be extremely helpful to have an option to forward the output, especially for integration testing purposes.

This flexibility would allow developers to enable detailed logging during critical phases such as integration testing and debugging while preventing from being overwhelmed by excessive log output during regular operation.