AcademySoftwareFoundation / rez

An integrated package configuration, build and deployment system for software
https://rez.readthedocs.io
Apache License 2.0
920 stars 329 forks source link

"process ID out of range" error when using rez-env from within a Docker container with namespaced PIDs #1732

Closed darkvertex closed 2 months ago

darkvertex commented 2 months ago

At my studio we have a Rocky image that resembles our workstation configuration and certain pipeline services use rez-env within containers to run some things.

I noticed while transitioning to use Rez in my containerized environment that it seems unhappy about namespaced PIDs, which is the default when you docker-run.

Environment

To Reproduce

I prepped a minimalist Dockerfile to reproduce the issue. Save this Dockerfile below:

FROM rockylinux:8.9

# Install python, bash, wget, locale stuff, lsb_release (for Rez) and do any security updates:
RUN yum install -y python3.9 bash glibc-langpack-en wget redhat-lsb-core && yum update -y && yum clean all
# Symlink "python3" as "python" so that "rez-bind --quickstart" does not complain:
RUN ln -s /usr/bin/python3 /usr/bin/python
# Set locale to English:
ENV LANG en_US.UTF-8
ENV LC_ALL en_US.UTF-8
# Python things:
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONFAULTHANDLER=1
ENV PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring

# Install Rez:
ARG REZ_VERSION=2.112.0
ENV REZ_VERSION=${REZ_VERSION}
ENV REZ_INSTALL_ROOT=/usr/local/rez
RUN wget -O /tmp/rez.tar.gz https://github.com/AcademySoftwareFoundation/rez/archive/refs/tags/${REZ_VERSION}.tar.gz && \
    mkdir --parents --mode=777 /tmp/rez && \
    tar -xzvf /tmp/rez.tar.gz -C /tmp/rez/ && \
    python3 /tmp/rez/rez-${REZ_VERSION}/install.py $REZ_INSTALL_ROOT/ && \
    rm -rf /tmp/rez /tmp/rez.tar.gz && \
    echo -e "export PATH=\$PATH:${REZ_INSTALL_ROOT}/bin/rez\nsource ${REZ_INSTALL_ROOT}/completion/complete.sh" >> /etc/profile.d/fps.sh
RUN $REZ_INSTALL_ROOT/bin/rez/rez-bind --quickstart
ENV PATH "$PATH:$REZ_INSTALL_ROOT/bin/rez"

then docker build . --tag=rocky_rez and brew some tea or coffee cause it'll take a minute or two.

To see the issue do any rez-env. Since this is a vanilla environment, use the python package made by the quickstarter:

$ docker run --rm -it rocky_rez rez-env python -- python --version
error: process ID out of range

Usage:
 ps [options]

 Try 'ps --help <simple|list|output|threads|misc|all>'
  or 'ps --help <s|l|o|t|m|a>'
 for additional help text.

For more details see ps(1).
Python 3.9.18

As you can see, it does run the thing so it is kind of a harmless error, just scary looking.

One way to get around it is to pass --pid=host so that PID namespace isolation is no longer in place, thus the pids are shared with the host's, but this isn't great because it weakens the security of the container, as it is no longer process-isolated:

$ docker run --rm -it --pid=host rocky_rez rez-env python -- python --version
Python 3.9.18

...but it runs without an error.

I have a suspicion it is probably due to this command in the codebase here: https://github.com/AcademySoftwareFoundation/rez/blob/f17e88dad283826a5914e3aade4fbc9a010cdba2/src/rez/system.py#L84 and I guess it's because by default a container runs as PID 1, without a parent.

Could it be detected in the shell code so it does not print a scary error?

Expected behavior

I would expect this error-free output:

$ docker run --rm -it rocky_rez rez-env python -- python --version
Python 3.9.18

Actual behavior

$ docker run --rm -it rocky_rez rez-env python -- python --version
error: process ID out of range

Usage:
 ps [options]

 Try 'ps --help <simple|list|output|threads|misc|all>'
  or 'ps --help <s|l|o|t|m|a>'
 for additional help text.

For more details see ps(1).
Python 3.9.18
JeanChristopheMorinPerso commented 2 months ago

Hi @darkvertex, thanks for the report and the detailed reproduction steps. I think we could probably check if the ppid is zero before doing running the ps command.

JeanChristopheMorinPerso commented 2 months ago

I created https://github.com/AcademySoftwareFoundation/rez/pull/1735 which should fix it. Whenever you have time, can you try it out and see if it fixes the problem on your side please?

darkvertex commented 2 months ago

@JeanChristopheMorinPerso Your PR #1735 appears to resolve the bug. 😄