databricks / containers

Sample base images for Databricks Container Services
Apache License 2.0
167 stars 118 forks source link

Wrong pyspark version during notebook runtime #187

Open moschnetzsch opened 7 months ago

moschnetzsch commented 7 months ago

Hi,

i created a custom docker image using this docker file.

FROM databricksruntime/minimal:14.3-LTS

ARG python_version="3.9"
ARG pip_version="22.3.1"
ARG setuptools_version="65.6.3"
ARG wheel_version="0.38.4"
ARG virtualenv_version="20.16.7"

# Installs python 3.9 and virtualenv for Spark and Notebooks
RUN apt update && apt upgrade -y
RUN apt install curl software-properties-common -y apt-utils
RUN add-apt-repository ppa:deadsnakes/ppa -y
RUN apt update
RUN ln -snf /usr/share/zoneinfo/$CONTAINER_TIMEZONE /etc/localtime && echo $CONTAINER_TIMEZONE > /etc/timezone
RUN apt install -y python${python_version} python${python_version}-dev python${python_version}-distutils
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN /usr/bin/python${python_version} get-pip.py pip==${pip_version} setuptools==${setuptools_version} wheel==${wheel_version}
RUN rm get-pip.py

RUN /usr/local/bin/pip${python_version} install --no-cache-dir virtualenv==${virtualenv_version} \
  && sed -i -r 's/^(PERIODIC_UPDATE_ON_BY_DEFAULT) = True$/\1 = False/' /usr/local/lib/python${python_version}/dist-packages/virtualenv/seed/embed/base_embed.py \
  && /usr/local/bin/pip${python_version} download pip==${pip_version} --dest \
  /usr/local/lib/python${python_version}/dist-packages/virtualenv_support/

# Initialize the default environment that Spark and notebooks will use
RUN virtualenv --python=python${python_version} --system-site-packages /databricks/python3 --no-download  --no-setuptools

# These python libraries are used by Databricks notebooks and the Python REPL
# You do not need to install pyspark - it is injected when the cluster is launched
# Versions are intended to reflect latest DBR: https://docs.databricks.com/release-notes/runtime/13.3.html#system-environment

COPY requirements.txt /databricks/.
COPY databricks_requirements.txt /databricks/.

# strip pywin32 as it is not needed on linux
RUN sed -i '/pywin32/d' /databricks/requirements.txt

RUN /databricks/python3/bin/pip install -r /databricks/requirements.txt
RUN /databricks/python3/bin/pip install -r /databricks/databricks_requirements.txt

# Install Databricks CLI
RUN apt install unzip -y
RUN curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh

# Specifies where Spark will look for the python process
ENV PYSPARK_PYTHON=/databricks/python3/bin/python3

As in the example, the PYSPARK_PYTHON variable is pointing to my custom python 3.9. environment. When I am checking the imported pyspark version, it is a different compared to the one that is installed as a sub dependency in my python environment as seen in the image.

image

This leads to a lot of complications when e.g. using the dataengineering client with pyspark. How can i make sure the import pyspark version is the one installed in my python environment?