We are using this Dockerfile to generate the virtualenv that we later provide to our Emr Serverless 7.1 Application to be used.
FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base
RUN dnf install -y gcc python3 python3-devel
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install \
venv-pack==0.2.0 \
pytz==2022.7.1 \
boto3==1.33.13 \
pandas==1.3.5 \
python-dateutil==2.8.2
RUN mkdir /output && venv-pack -o /output/pyspark_ge.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_ge.tar.gz /
Within the Spark application we have a part which is calling ['aws', 's3', 'mv'] by calling check_call from subprocess module.
In that case it seems like the virtualenv is not used but the global python is used which is coming without dateutil (python 3.9)
Of course one could rewrite the application to call from the code logic with the current running binary but I also expected that I could provide an option to tell the emr serverless application "in general" to use my virtualenv and not just when running my pyspark application. Is it possible or is this behavior expected?
Hello,
We are using this Dockerfile to generate the virtualenv that we later provide to our Emr Serverless 7.1 Application to be used.
Within the Spark application we have a part which is calling ['aws', 's3', 'mv'] by calling check_call from subprocess module. In that case it seems like the virtualenv is not used but the global python is used which is coming without dateutil (python 3.9)
Of course one could rewrite the application to call from the code logic with the current running binary but I also expected that I could provide an option to tell the emr serverless application "in general" to use my virtualenv and not just when running my pyspark application. Is it possible or is this behavior expected?