aws-samples / emr-serverless-samples

Example code for running Spark and Hive jobs on EMR Serverless.
https://aws.amazon.com/emr/serverless/
MIT No Attribution
155 stars 77 forks source link

virtualenv is not used when calling subprocess module #63

Open hguercan opened 7 months ago

hguercan commented 7 months ago

Hello,

We are using this Dockerfile to generate the virtualenv that we later provide to our Emr Serverless 7.1 Application to be used.

FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base

RUN dnf install -y gcc python3 python3-devel

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install \
    venv-pack==0.2.0 \
    pytz==2022.7.1 \
    boto3==1.33.13 \
    pandas==1.3.5 \
    python-dateutil==2.8.2

RUN mkdir /output && venv-pack -o /output/pyspark_ge.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_ge.tar.gz /

Within the Spark application we have a part which is calling ['aws', 's3', 'mv'] by calling check_call from subprocess module. In that case it seems like the virtualenv is not used but the global python is used which is coming without dateutil (python 3.9)

Of course one could rewrite the application to call from the code logic with the current running binary but I also expected that I could provide an option to tell the emr serverless application "in general" to use my virtualenv and not just when running my pyspark application. Is it possible or is this behavior expected?