aws / sagemaker-training-toolkit

Train machine learning models within a 🐳 Docker container using 🧠 Amazon SageMaker.
Apache License 2.0
496 stars 118 forks source link

Adding sys.path to PYTHONPATH breaks virtual environments #186

Open pdveenstra opened 1 year ago

pdveenstra commented 1 year ago

Describe the bug

In src/sagemaker_training/environment.py:1219, the values in sys.path (of the environment in which sagemaker-training-toolkit is executed) are added to the environment variable $PYTHONPATH.

This breaks virtual environments, because the result is that $PYTHONPATH ends up pointing to libraries in the base environment. So when we run code in a virtual conda environment, that environments' sys.path ends up looking at the $PYTHONPATH contents, which were set to the base environment's sys.path. In effect, this overwrites a virtual environments sys.path with the base environment's sys.path due to the sys.path search order link.

The end result is that the base environment libraries are given precedence over the virtual environment libraries.

We have a solution for now, which is to simply unset PYTHONPATH before we do anything with virtual environments.

However, I'm filing a bug report because it would be preferable for virtual environments to work out-of-the-box, and because this was a particularly difficult problem to find as it can present in so many different ways.

(For us, this issue was discovered because it led to a conda virtual environment with python 3.9 crashing because it attempted to use python 3.10 libraries in the base environment. Note that the error below is not important to understanding the issue, as there are many ways this issue can present itself - I'm only adding it to help others find this issue.)

Fatal Python error: init_sys_streams: can't initialize sys standard streams
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/io.py", line 54, in <module>
ImportError: cannot import name 'text_encoding' from 'io' (unknown location)

To reproduce

See the minimal example below to reproduce. In this minimal example I install two different versions of isort, namely 5.11.4 in base conda environment and 5.12.0 in dev conda environment. The output of the example is as follows, which shows the problem in action:

***** python location and version *****
(BASE): conda run --no-capture-output --live-stream -n base which python && python --version
/opt/conda/bin/python
Python 3.10.12
(DEV): conda run --no-capture-output --live-stream -n dev which python && python --version
/opt/conda/envs/dev/bin/python
Python 3.10.12

***** isort version seen by conda *****
(BASE): conda run --no-capture-output --live-stream -n base conda list isort
# packages in environment at /opt/conda:
isort                     5.11.4             pyhd8ed1ab_1    conda-forge
(DEV): conda run --no-capture-output --live-stream -n dev conda list isort
# packages in environment at /opt/conda/envs/dev:
isort                     5.12.0             pyhd8ed1ab_1    conda-forge

***** isort version inside python *****
(BASE): conda run --no-capture-output --live-stream -n base python -c 'import isort; print(isort.__version__)'
5.11.4
(DEV): conda run --no-capture-output --live-stream -n dev python -c 'import isort; print(isort.__version__)'
5.11.4

***** isort version inside python - after unset PYTHONPATH*****
(BASE): unset PYTHONPATH && conda run --no-capture-output --live-stream -n base python -c 'import isort; print(isort.__version__)'
5.11.4
(DEV): unset PYTHONPATH && conda run --no-capture-output --live-stream -n dev python -c 'import isort; print(isort.__version__)'
5.12.0

To reproduce the above example, use the files problem.Dockerfile, launcher.py and wrong_version_demo.py as defined below (inserting appropriate values for your aws account + region where <insert> is written), and then locally do:

conda create -n problem
conda activate problem
mamba install sagemaker-python-sdk=2.166.0

docker build --file problem.Dockerfile --target base -t <insert> .
python launcher.py

problem.Dockerfile

# syntax=docker/dockerfile:1
FROM condaforge/mambaforge:23.1.0-2 as base

ARG DEBIAN_FRONTEND=noninteractive

RUN apt-get update -yqq
RUN apt-get install -yqq build-essential
RUN apt-get install -yqq python3-dev
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/* /var/lib/{apt,dpkg,cache,log}

RUN conda config --set always_yes true \
 && conda config --set unsatisfiable_hints true \
 && conda config --prepend channels defaults \
 && conda config --prepend channels conda-forge

RUN pip3 install sagemaker-training==4.6.1

RUN conda install -n base isort=5.11.4
RUN conda create -n dev python=3.10 isort=5.12.0

WORKDIR /work

launcher.py

from pathlib import Path
import sagemaker
from sagemaker.estimator import Estimator

def main():
    sagemaker_session = sagemaker.LocalSession()
    current_file_path = Path(__file__).resolve().absolute()
    local_script_path = current_file_path.parent / "wrong_version_demo.py"

    image_uri="<INSERT>"
    role="<INSERT>"
    subnets=["<INSERT>"],
    security_group_ids=["<INSERT>"],

    # Launch the training job
    estimator = Estimator(
        environment={"SAGEMAKER_PROGRAM": "wrong_version_demo.py"},
        image_uri=image_uri,
        role=role,
        subnets=subnets,
        security_group_ids=security_group_ids,
        instance_count=1,
        instance_type="local",
        max_run=1*60*60,
        sagemaker_session=sagemaker_session,
        entry_point=local_script_path
    )
    estimator.fit(inputs=None, job_name="problem-job", wait=True)

if __name__ == "__main__":
    main()

wrong_version_demo.py

import os
import subprocess
import sys

def demo():

    print("="*100)

    print(f"{'*'*5} python location and version {'*'*5}")
    cmd = "conda run --no-capture-output --live-stream -n base which python && python --version"
    print(f"(BASE): {cmd}")
    execute(cmd)

    cmd = "conda run --no-capture-output --live-stream -n dev which python && python --version"
    print(f"(DEV): {cmd}")
    execute(cmd)

    print(f"{'*'*5} isort version seen by conda {'*'*5}")
    cmd = "conda run --no-capture-output --live-stream -n base conda list isort"
    print(f"(BASE): {cmd}")
    execute(cmd)

    cmd = "conda run --no-capture-output --live-stream -n dev conda list isort"
    print(f"(DEV): {cmd}")
    execute(cmd)

    print(f"{'*'*5} isort version inside python {'*'*5}")
    cmd = "conda run --no-capture-output --live-stream -n base python -c 'import isort; print(isort.__version__)'"
    print(f"(BASE): {cmd}")
    execute(cmd)

    cmd = "conda run --no-capture-output --live-stream -n dev python -c 'import isort; print(isort.__version__)'"
    print(f"(DEV): {cmd}")
    execute(cmd)

    print(f"{'*'*5} isort version inside python - after unset PYTHONPATH{'*'*5}")
    cmd = "unset PYTHONPATH && conda run --no-capture-output --live-stream -n base python -c 'import isort; print(isort.__version__)'"
    print(f"(BASE): {cmd}")
    execute(cmd)

    cmd = "unset PYTHONPATH && conda run --no-capture-output --live-stream -n dev python -c 'import isort; print(isort.__version__)'"
    print(f"(DEV): {cmd}")
    execute(cmd)

    print("="*100)

def execute(cmd: str):

    subprocess.run(
        cmd,
        shell=True,
        check=True,
        env=os.environ.copy(),
        bufsize=1,
        text=True,
        stderr=sys.stderr,
        stdout=sys.stdout
    )

if __name__ == "__main__":
    demo()

Expected behavior

Either:

  1. Allow virtual environments to work out of the box without polluting their module search paths.
  2. If (1) isn't possible, then document the use of conda environments / virtual environments.

Screenshots or logs N/A

System information Versions are specified in reproduced example section.

Additional context N/A