ModuleNotFoundError when running sample code

waltari2001 commented 9 months ago

Hi,

I have run the example from https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/dependencies

aws emr-serverless start-job-run \ --application-id $APPLICATION_ID \ --execution-role-arn $JOB_ROLE_ARN \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/ge_profile.py", "entryPointArguments": ["s3://'${S3_BUCKET}'/tmp/ge-profile"], "sparkSubmitParameters": "--conf spark.archives=s3://'${S3_BUCKET}'/artifacts/pyspark/pyspark_ge.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python" } }' \ --configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://'${S3_BUCKET}'/logs/" } } }'

And keep getting

Traceback (most recent call last): File "/tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/ge_profile.py", line 4, in import great_expectations as ge ModuleNotFoundError: No module named 'great_expectations'

I can confirm that the module is included in pyspark_ge.tar.gz

Thanks for the help

Eric

dacort commented 9 months ago

Hi Eric, thanks for opening the issue - let me double-check this example to verify it still works.

dacort commented 9 months ago

@waltari2001 I just ran through the example step-by-step and it works for me. Some things for you to check:

Did the pyspark_ge.tar.gz get bundled properly? e.g. if you run tar tzvf pyspark_ge.tar.gz | grep great_expectations, are there entries in the tar file?

Check the EMR Serverless stderr and stdout logs and see if there's anything in there. In the stderr file, you should see lines like this:

24/01/08 17:43:52 INFO SparkContext: Added archive file:/tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz#environment at spark://[2600:1f14:2e15:a301:cd15:538d:7b16:decc]:46003/files/pyspark_ge.tar.gz with timestamp 1704735831138
24/01/08 17:43:52 INFO Utils: Copying /tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz to /tmp/spark-84fd23a2-89bc-4573-a253-1ee2ecc61486/pyspark_ge.tar.gz
24/01/08 17:43:52 INFO SparkContext: Unpacking an archive file:/tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz#environment from /tmp/spark-84fd23a2-89bc-4573-a253-1ee2ecc61486/pyspark_ge.tar.gz to /tmp/spark-69cc27ba-e1dc-4691-a6ce-bc504186703d/userFiles-85021b77-6d66-4148-858b-f34d9ed35489/environment

waltari2001 commented 9 months ago

Yes, the bundle does include the great_expectations module.
Here's the stdout & stderr outputs

stderr:

Files s3://splunk-config-test/artifacts/pyspark/pyspark_ge.tar.gz#environment from /tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/pyspark_ge.tar.gz to /home/hadoop/environment 24/01/08 15:30:44 INFO ShutdownHookManager: Shutdown hook called 24/01/08 15:30:44 INFO ShutdownHookManager: Deleting directory /tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0

stdout: Traceback (most recent call last): File "/tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/ge_profile.py", line 4, in import great_expectations as ge ModuleNotFoundError: No module named 'great_expectations'

I am also using emr-7.0.0 with this test.

dacort commented 9 months ago

Ah interesting...just tested 7.0.0 now and found that it doesn't work. I was on 6.14.0.

Investigating more. It could be because EMR 7 is using Amazon Linux 2023...may need to swap out the Docker base image.

dacort commented 9 months ago

OK, the image definitely needs to be updated to al2023. That said, while I can submit the job and it starts running, it just hangs and the executors die. I'm unsure if this is a great expectations compatibility issue with Spark 3.5(?) or something else...I tried updating to the latest version of GE, but still experiencing the issue.

edit: I think it's user error - didn't configure the EMR Serverless application with networking and it's not in the same region as the source data

Let me know if that's relevant or if you were just trying to get dependencies working in general.

This is the dockerfile I used:

FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base

RUN dnf install -y gcc python3 python3-devel

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install \
    great_expectations==0.18.7 \
    venv-pack==0.2.0

RUN mkdir /output && venv-pack -o /output/pyspark_ge.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_ge.tar.gz /

dacort commented 9 months ago

Yup, needed to configure the application properly.

This works now with Al2023 as the base image on EMR 7.x.

I'll leave this open until I update the examples.

PeterCarragher commented 9 months ago

Hi @dacort, I'm trying to use this Dockerfile.AI2023 as a blueprint for creating an environment that has certain packages installed (having run into the same ModuleNotFoundErrors using the original Dockerfile in the repo). Specifically:

FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base
RUN dnf install -y gcc python3 python3-devel

ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install \
    botocore boto3 requests warcio idna \
    venv-pack==0.2.0

RUN mkdir /output && venv-pack -o /output/pyspark_venv.tar.gz

FROM scratch AS export
COPY --from=base /output/pyspark_venv.tar.gz /outputs/

After running this (with DOCKER_BUILDKIT=1 sudo docker build --output . .) and inspecting the contents of the output .tar.gz, I find that it contains bin/python, bin/python3 and bin/python3.9 binaries. However, none of these binaries have the required packages:

~/dev/cc/emr-serverless-samples/examples/pyspark/dependencies/outputs/bin$ ./python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import idna
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'idna'

docker build indicates that the packages have been installed:

[+] Building 144.7s (11/11) FINISHED                                                                                                                   docker:default
 => [internal] load build definition from Dockerfile                                                                                                             0.0s
 => => transferring dockerfile: 601B                                                                                                                             0.0s
 => [internal] load .dockerignore                                                                                                                                0.0s
 => => transferring context: 2B                                                                                                                                  0.0s
 => [internal] load metadata for public.ecr.aws/amazonlinux/amazonlinux:2023-minimal                                                                             1.5s
 => [base 1/6] FROM public.ecr.aws/amazonlinux/amazonlinux:2023-minimal@sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb                  4.1s
 => => resolve public.ecr.aws/amazonlinux/amazonlinux:2023-minimal@sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb                       0.0s
 => => sha256:e831127e1f042c58b74da069439cf1452efa4314ca19c69b8e376186aabcb714 35.06MB / 35.06MB                                                                 2.0s
 => => sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb 770B / 770B                                                                       0.0s
 => => sha256:fd9eb74c5472b7e4286b3ae4b3649b2c7eb8968684e3a8d9158241417ca813be 529B / 529B                                                                       0.0s
 => => sha256:40c4449cff5bdec9bf82d3929159e57174488d711a8a9350790b24b3cc0104f3 1.48kB / 1.48kB                                                                   0.0s
 => => extracting sha256:e831127e1f042c58b74da069439cf1452efa4314ca19c69b8e376186aabcb714                                                                        1.9s
 => [base 2/6] RUN dnf install -y gcc python3 python3-devel                                                                                                     37.4s
 => [base 3/6] RUN python3 -m venv /opt/venv                                                                                                                     3.6s 
 => [base 4/6] RUN python3 -m pip install --upgrade pip &&     python3 -m pip install     great_expectations==0.18.7     venv-pack==0.2.0                       66.5s 
 => [base 5/6] RUN python3 -m pip install botocore boto3 requests warcio idna                                                                                    7.8s 
 => [base 6/6] RUN mkdir /output && venv-pack -o /output/pyspark_venv.tar.gz                                                                                    21.2s 
 => [export 1/1] COPY --from=base /output/pyspark_venv.tar.gz /outputs/                                                                                          1.2s 
 => exporting to client directory                                                                                                                                0.8s 
 => => copying files 169.91MB

Any idea what the problem is? If I attempt to use the output archive in the intended fashion on EMR studio, I just get a ModuleNotFoundError.

dacort commented 9 months ago

@PeterCarragher A few questions:

Are you trying to build an image with idna installed to use in EMR Studio (interactive) or as part of your batch jobs?
Which release of EMR are you using?
What are you doing with the bundled tar.gz? Are you providing options to EMR to make use of it?

For reference, the AL2023 image is only compatible with EMR 7.x. If you're using it with EMR Studio, you use a customized image based off the EMR Serverless image. If you're using it with batch jobs, you need to [provide the proper sparkSubmitParameters to copy/enable the virtualenv.

Let me know, happy to help and maybe even put together a little video. :)

PeterCarragher commented 9 months ago

@dacort thank you for the quick reply! TLDR; using the correct image fixes the problem

I am trying to get up and running on EMR studio.

Application settings:

Release: EMR 7.0.0
Type: Spark
Architectre: x86_64

Job settings (passing the .tar.gz):

--conf spark.submit.pyFiles=s3://cc-pyspark/sparkcc.py 
--conf spark.archives=s3://cc-pyspark/pyspark_venv.tar.gz#environment 
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python 
--conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python

stdout

Jan 16, 2024 12:42:06 AM org.apache.spark.launcher.Log4jHotPatchOption staticJavaAgentOption
WARNING: spark.log4jHotPatch.enabled is set to true, but /usr/share/log4j-cve-2021-44228-hotpatch/jdk17/Log4jHotPatchFat.jar does not exist at the configured location

Files s3://cc-pyspark/pyspark_venv.tar.gz#environment from /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/pyspark_venv.tar.gz to /home/hadoop/environment
Files s3://cc-pyspark/sparkcc.py from /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/sparkcc.py to /home/hadoop/sparkcc.py
24/01/16 00:42:11 INFO ShutdownHookManager: Shutdown hook called
24/01/16 00:42:11 INFO ShutdownHookManager: Deleting directory /tmp/localPyFiles-3bbd171b-bb9c-42cc-aad9-66ff2dae7065
24/01/16 00:42:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca

stderr

Traceback (most recent call last):
  File "/tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/wat_extract_links.py", line 1, in <module>
    import idna
ModuleNotFoundError: No module named 'idna'

After getting this error I figured the issue was in the step where I set up the archive as described in the readme. So I debugged locally to see if the python binaries in the archive had these modules; I got the same import errors locally.

Following the URL you shared, I tried updating to an image that matches the EMR studio version I'm using:

FROM --platform=linux/x86_64 public.ecr.aws/emr-serverless/spark/emr-7.0.0:latest AS base
USER root
RUN yum install -y gcc python3 python3-devel

Re-uploading to s3 and testing the EMR spark job, it runs now. Thanks for the help!

However testing the imports locally still fails. For future reference is there a way to test the python binaries locally?

dacort commented 9 months ago

Hm, there's something odd going on here.

It looks like idna is included by default in Python 3.9 (which is used in EMR 7.0.0). So it should work locally without even doing anything with the image. For example:

❯ docker run --rm -it --entrypoint /bin/bash public.ecr.aws/emr-serverless/spark/emr-7.0.0
bash-5.2$ python3 -c "import idna; print(idna.decode('xn--eckwd4c7c.xn--zckzah'))"
ドメイン.テスト

eb-pagos commented 1 week ago

Hi @dacort,

For this environment:

EMR Serverless: v7.2
Architecture: x86_64
Type: Spark

I'm experiencing same issue with python 3.11.6, leading to Module not found for my python packages (worth noting, it loads file passed in spark.submit.pyFiles='...'):

StdOut:

Traceback (most recent call last):
  File "/tmp/spark-c646f03c-8be8-404b-821a-be175c437ce6/emr_cli.py", line 4, in <module>
    from my_app.etl_entrypoint import ETLEntrypoint
  File "<frozen zipimport>", line 259, in load_module
  File "/tmp/spark-c646f03c-8be8-404b-821a-be175c437ce6/my_app.zip/my_app/etl_entrypoint.py", line 4, in <module>
ModuleNotFoundError: No module named 'dependency_injector'

StdErr:

Oct 17, 2024 1:07:43 AM org.apache.spark.launcher.Log4jHotPatchOption staticJavaAgentOption
WARNING: spark.log4jHotPatch.enabled is set to true, but /usr/share/log4j-cve-2021-44228-hotpatch/jdk17/Log4jHotPatchFat.jar does not exist at the configured location

Files s3://my-bucket/pyspark_deps.tar.gz#environment from /tmp/spark-a01defc2-a54b-47ff-9087-55e93a1f929d/pyspark_deps.tar.gz to /home/hadoop/environment
Files s3:/my-bucket/src.zip from /tmp/spark-a01defc2-a54b-47ff-9087-55e93a1f929d/my_app.zip to /home/hadoop/my_app.zip
24/10/17 01:07:51 INFO ShutdownHookManager: Shutdown hook called
24/10/17 01:07:51 INFO ShutdownHookManager: Deleting directory /tmp/spark-a01defc2-a54b-47ff-9087-55e93a1f929d

This is my Dockerfile (and I've tried it with other images as well, tried different ways of installing the python libs):

FROM --platform=linux/amd64 public.ecr.aws/emr-serverless/spark/emr-7.0.0:latest AS base
USER root
# Install Python 3.11 and other necessary tools
RUN dnf install -y python3.11 python3.11-pip tar gzip && \
    dnf clean all && \
    alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 && \
    alternatives --set python3 /usr/bin/python3.11 && \
    ln -sf /usr/bin/python3.11 /usr/local/bin/python

ENV PYSPARK_PYTHON=/usr/bin/python3.11

# Create the virtual environment
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN source $VIRTUAL_ENV/bin/activate

# Upgrade pip and install Poetry and venv-pack (I listed all deps one by one to make sure package maaner is not missing something from requirements.txt)
ENV POETRY_VERSION=1.8.1
RUN python3 -m pip install --upgrade pip && \
    python3 -m pip install poetry==$POETRY_VERSION venv-pack && \
    python3 -m pip install dependency-injector==4.42.0 && \
    python3 -m pip install loguru==0.7.2 && \
    python3 -m pip install pydantic-core==2.23.4 && \
    python3 -m pip install pydantic==2.9.2 && \
    python3 -m pip install pyyaml==6.0.2 && \
    python3 -m pip install six==1.16.0 && \
    python3 -m pip install sqlalchemy==2.0.36 && \
    python3 -m pip install typing-extensions==4.12.2 && \
    python3 -m pip install annotated-types==0.7.0

ENV PATH="$PATH:/root/.local/bin"

WORKDIR /app
COPY . .

# Install project dependencies using Poetry
#RUN poetry config virtualenvs.create false && \
#    poetry install --no-root --no-dev

RUN python3 -m pip install -r requirements.txt

# Package the virtual environment using venv-pack
RUN mkdir -p dist && \
    venv-pack -o dist/pyspark_deps.tar.gz
USER hadoop:hadoop

# Export the packaged virtual environment
FROM scratch
COPY --from=base /app/dist/pyspark_deps.tar.gz /

I'm triggering with parameters below:

aws emr-serverless start-job-run \
    --application-id $APPLICATION_ID \
    --name "MyETLJob" \
    --execution-role-arn $IAM_ROLE \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://my-bucket/emr_cli.py",
            "sparkSubmitParameters": "--conf spark.archives=s3://my-bucket/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.submit.pyFiles=s3://my-bucket/my_app.zip"
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://my-bucket/logs/"
            }
        }
    }'

I can also confirm the module is in the bundle of python packages. Could the issue be related to my python version conflicting with internal python version?

dacort commented 1 week ago

Based on the ModuleNotFoundError: No module named 'dependency_injector' error, is dependency_injector part of your package? If so, it might be that your package isn't getting put into the pyspark_deps.tar.gz archive. I'd check there, first. And sorry, I'm not with AWS anymore so can't provide much more insight.

aws-samples / emr-serverless-samples

ModuleNotFoundError when running sample code #56