Closed waltari2001 closed 9 months ago
Hi Eric, thanks for opening the issue - let me double-check this example to verify it still works.
@waltari2001 I just ran through the example step-by-step and it works for me. Some things for you to check:
pyspark_ge.tar.gz
get bundled properly? e.g. if you run tar tzvf pyspark_ge.tar.gz | grep great_expectations
, are there entries in the tar file?stderr
file, you should see lines like this:
24/01/08 17:43:52 INFO SparkContext: Added archive file:/tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz#environment at spark://[2600:1f14:2e15:a301:cd15:538d:7b16:decc]:46003/files/pyspark_ge.tar.gz with timestamp 1704735831138
24/01/08 17:43:52 INFO Utils: Copying /tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz to /tmp/spark-84fd23a2-89bc-4573-a253-1ee2ecc61486/pyspark_ge.tar.gz
24/01/08 17:43:52 INFO SparkContext: Unpacking an archive file:/tmp/spark-3436775b-fa40-43bb-b73e-d3b55b987773/pyspark_ge.tar.gz#environment from /tmp/spark-84fd23a2-89bc-4573-a253-1ee2ecc61486/pyspark_ge.tar.gz to /tmp/spark-69cc27ba-e1dc-4691-a6ce-bc504186703d/userFiles-85021b77-6d66-4148-858b-f34d9ed35489/environment
stderr:
Files s3://splunk-config-test/artifacts/pyspark/pyspark_ge.tar.gz#environment from /tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/pyspark_ge.tar.gz to /home/hadoop/environment 24/01/08 15:30:44 INFO ShutdownHookManager: Shutdown hook called 24/01/08 15:30:44 INFO ShutdownHookManager: Deleting directory /tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0
stdout:
Traceback (most recent call last):
File "/tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/ge_profile.py", line 4, in
I am also using emr-7.0.0 with this test.
Ah interesting...just tested 7.0.0 now and found that it doesn't work. I was on 6.14.0.
Investigating more. It could be because EMR 7 is using Amazon Linux 2023...may need to swap out the Docker base image.
OK, the image definitely needs to be updated to al2023. That said, while I can submit the job and it starts running, it just hangs and the executors die. I'm unsure if this is a great expectations compatibility issue with Spark 3.5(?) or something else...I tried updating to the latest version of GE, but still experiencing the issue.
edit: I think it's user error - didn't configure the EMR Serverless application with networking and it's not in the same region as the source data
Let me know if that's relevant or if you were just trying to get dependencies working in general.
This is the dockerfile I used:
FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base
RUN dnf install -y gcc python3 python3-devel
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install \
great_expectations==0.18.7 \
venv-pack==0.2.0
RUN mkdir /output && venv-pack -o /output/pyspark_ge.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_ge.tar.gz /
Yup, needed to configure the application properly.
This works now with Al2023 as the base image on EMR 7.x.
I'll leave this open until I update the examples.
Hi @dacort, I'm trying to use this Dockerfile.AI2023 as a blueprint for creating an environment that has certain packages installed (having run into the same ModuleNotFoundErrors using the original Dockerfile in the repo). Specifically:
FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal AS base
RUN dnf install -y gcc python3 python3-devel
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install \
botocore boto3 requests warcio idna \
venv-pack==0.2.0
RUN mkdir /output && venv-pack -o /output/pyspark_venv.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_venv.tar.gz /outputs/
After running this (with DOCKER_BUILDKIT=1 sudo docker build --output . .
) and inspecting the contents of the output .tar.gz, I find that it contains bin/python, bin/python3 and bin/python3.9 binaries.
However, none of these binaries have the required packages:
~/dev/cc/emr-serverless-samples/examples/pyspark/dependencies/outputs/bin$ ./python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import idna
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'idna'
docker build indicates that the packages have been installed:
[+] Building 144.7s (11/11) FINISHED docker:default
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 601B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [internal] load metadata for public.ecr.aws/amazonlinux/amazonlinux:2023-minimal 1.5s
=> [base 1/6] FROM public.ecr.aws/amazonlinux/amazonlinux:2023-minimal@sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb 4.1s
=> => resolve public.ecr.aws/amazonlinux/amazonlinux:2023-minimal@sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb 0.0s
=> => sha256:e831127e1f042c58b74da069439cf1452efa4314ca19c69b8e376186aabcb714 35.06MB / 35.06MB 2.0s
=> => sha256:88fe8c5fd82cee2e8a9cbbeef4fd41ed1fe840ff2163912d7698948cecf91edb 770B / 770B 0.0s
=> => sha256:fd9eb74c5472b7e4286b3ae4b3649b2c7eb8968684e3a8d9158241417ca813be 529B / 529B 0.0s
=> => sha256:40c4449cff5bdec9bf82d3929159e57174488d711a8a9350790b24b3cc0104f3 1.48kB / 1.48kB 0.0s
=> => extracting sha256:e831127e1f042c58b74da069439cf1452efa4314ca19c69b8e376186aabcb714 1.9s
=> [base 2/6] RUN dnf install -y gcc python3 python3-devel 37.4s
=> [base 3/6] RUN python3 -m venv /opt/venv 3.6s
=> [base 4/6] RUN python3 -m pip install --upgrade pip && python3 -m pip install great_expectations==0.18.7 venv-pack==0.2.0 66.5s
=> [base 5/6] RUN python3 -m pip install botocore boto3 requests warcio idna 7.8s
=> [base 6/6] RUN mkdir /output && venv-pack -o /output/pyspark_venv.tar.gz 21.2s
=> [export 1/1] COPY --from=base /output/pyspark_venv.tar.gz /outputs/ 1.2s
=> exporting to client directory 0.8s
=> => copying files 169.91MB
Any idea what the problem is? If I attempt to use the output archive in the intended fashion on EMR studio, I just get a ModuleNotFoundError.
@PeterCarragher A few questions:
idna
installed to use in EMR Studio (interactive) or as part of your batch jobs? tar.gz
? Are you providing options to EMR to make use of it?For reference, the AL2023 image is only compatible with EMR 7.x. If you're using it with EMR Studio, you use a customized image based off the EMR Serverless image. If you're using it with batch jobs, you need to [provide the proper sparkSubmitParameters to copy/enable the virtualenv.
Let me know, happy to help and maybe even put together a little video. :)
@dacort thank you for the quick reply! TLDR; using the correct image fixes the problem
I am trying to get up and running on EMR studio.
Application settings:
Job settings (passing the .tar.gz):
--conf spark.submit.pyFiles=s3://cc-pyspark/sparkcc.py
--conf spark.archives=s3://cc-pyspark/pyspark_venv.tar.gz#environment
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python
--conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python
stdout
Jan 16, 2024 12:42:06 AM org.apache.spark.launcher.Log4jHotPatchOption staticJavaAgentOption
WARNING: spark.log4jHotPatch.enabled is set to true, but /usr/share/log4j-cve-2021-44228-hotpatch/jdk17/Log4jHotPatchFat.jar does not exist at the configured location
Files s3://cc-pyspark/pyspark_venv.tar.gz#environment from /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/pyspark_venv.tar.gz to /home/hadoop/environment
Files s3://cc-pyspark/sparkcc.py from /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/sparkcc.py to /home/hadoop/sparkcc.py
24/01/16 00:42:11 INFO ShutdownHookManager: Shutdown hook called
24/01/16 00:42:11 INFO ShutdownHookManager: Deleting directory /tmp/localPyFiles-3bbd171b-bb9c-42cc-aad9-66ff2dae7065
24/01/16 00:42:11 INFO ShutdownHookManager: Deleting directory /tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca
stderr
Traceback (most recent call last):
File "/tmp/spark-e8b83dd4-dfa8-45bb-ba57-b7f2a7a918ca/wat_extract_links.py", line 1, in <module>
import idna
ModuleNotFoundError: No module named 'idna'
After getting this error I figured the issue was in the step where I set up the archive as described in the readme. So I debugged locally to see if the python binaries in the archive had these modules; I got the same import errors locally.
Following the URL you shared, I tried updating to an image that matches the EMR studio version I'm using:
FROM --platform=linux/x86_64 public.ecr.aws/emr-serverless/spark/emr-7.0.0:latest AS base
USER root
RUN yum install -y gcc python3 python3-devel
Re-uploading to s3 and testing the EMR spark job, it runs now. Thanks for the help!
However testing the imports locally still fails. For future reference is there a way to test the python binaries locally?
Hm, there's something odd going on here.
It looks like idna
is included by default in Python 3.9 (which is used in EMR 7.0.0). So it should work locally without even doing anything with the image. For example:
❯ docker run --rm -it --entrypoint /bin/bash public.ecr.aws/emr-serverless/spark/emr-7.0.0
bash-5.2$ python3 -c "import idna; print(idna.decode('xn--eckwd4c7c.xn--zckzah'))"
ドメイン.テスト
Hi @dacort,
For this environment:
I'm experiencing same issue with python 3.11.6, leading to Module not found for my python packages (worth noting, it loads file passed in spark.submit.pyFiles='...'):
StdOut:
Traceback (most recent call last):
File "/tmp/spark-c646f03c-8be8-404b-821a-be175c437ce6/emr_cli.py", line 4, in <module>
from my_app.etl_entrypoint import ETLEntrypoint
File "<frozen zipimport>", line 259, in load_module
File "/tmp/spark-c646f03c-8be8-404b-821a-be175c437ce6/my_app.zip/my_app/etl_entrypoint.py", line 4, in <module>
ModuleNotFoundError: No module named 'dependency_injector'
StdErr:
Oct 17, 2024 1:07:43 AM org.apache.spark.launcher.Log4jHotPatchOption staticJavaAgentOption
WARNING: spark.log4jHotPatch.enabled is set to true, but /usr/share/log4j-cve-2021-44228-hotpatch/jdk17/Log4jHotPatchFat.jar does not exist at the configured location
Files s3://my-bucket/pyspark_deps.tar.gz#environment from /tmp/spark-a01defc2-a54b-47ff-9087-55e93a1f929d/pyspark_deps.tar.gz to /home/hadoop/environment
Files s3:/my-bucket/src.zip from /tmp/spark-a01defc2-a54b-47ff-9087-55e93a1f929d/my_app.zip to /home/hadoop/my_app.zip
24/10/17 01:07:51 INFO ShutdownHookManager: Shutdown hook called
24/10/17 01:07:51 INFO ShutdownHookManager: Deleting directory /tmp/spark-a01defc2-a54b-47ff-9087-55e93a1f929d
This is my Dockerfile (and I've tried it with other images as well, tried different ways of installing the python libs):
FROM --platform=linux/amd64 public.ecr.aws/emr-serverless/spark/emr-7.0.0:latest AS base
USER root
# Install Python 3.11 and other necessary tools
RUN dnf install -y python3.11 python3.11-pip tar gzip && \
dnf clean all && \
alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 && \
alternatives --set python3 /usr/bin/python3.11 && \
ln -sf /usr/bin/python3.11 /usr/local/bin/python
ENV PYSPARK_PYTHON=/usr/bin/python3.11
# Create the virtual environment
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN source $VIRTUAL_ENV/bin/activate
# Upgrade pip and install Poetry and venv-pack (I listed all deps one by one to make sure package maaner is not missing something from requirements.txt)
ENV POETRY_VERSION=1.8.1
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install poetry==$POETRY_VERSION venv-pack && \
python3 -m pip install dependency-injector==4.42.0 && \
python3 -m pip install loguru==0.7.2 && \
python3 -m pip install pydantic-core==2.23.4 && \
python3 -m pip install pydantic==2.9.2 && \
python3 -m pip install pyyaml==6.0.2 && \
python3 -m pip install six==1.16.0 && \
python3 -m pip install sqlalchemy==2.0.36 && \
python3 -m pip install typing-extensions==4.12.2 && \
python3 -m pip install annotated-types==0.7.0
ENV PATH="$PATH:/root/.local/bin"
WORKDIR /app
COPY . .
# Install project dependencies using Poetry
#RUN poetry config virtualenvs.create false && \
# poetry install --no-root --no-dev
RUN python3 -m pip install -r requirements.txt
# Package the virtual environment using venv-pack
RUN mkdir -p dist && \
venv-pack -o dist/pyspark_deps.tar.gz
USER hadoop:hadoop
# Export the packaged virtual environment
FROM scratch
COPY --from=base /app/dist/pyspark_deps.tar.gz /
I'm triggering with parameters below:
aws emr-serverless start-job-run \
--application-id $APPLICATION_ID \
--name "MyETLJob" \
--execution-role-arn $IAM_ROLE \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://my-bucket/emr_cli.py",
"sparkSubmitParameters": "--conf spark.archives=s3://my-bucket/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.submit.pyFiles=s3://my-bucket/my_app.zip"
}
}' \
--configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": "s3://my-bucket/logs/"
}
}
}'
I can also confirm the module is in the bundle of python packages. Could the issue be related to my python version conflicting with internal python version?
Based on the ModuleNotFoundError: No module named 'dependency_injector'
error, is dependency_injector
part of your package? If so, it might be that your package isn't getting put into the pyspark_deps.tar.gz
archive. I'd check there, first. And sorry, I'm not with AWS anymore so can't provide much more insight.
Hi,
I have run the example from https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/dependencies
aws emr-serverless start-job-run \ --application-id $APPLICATION_ID \ --execution-role-arn $JOB_ROLE_ARN \ --job-driver '{ "sparkSubmit": { "entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/ge_profile.py", "entryPointArguments": ["s3://'${S3_BUCKET}'/tmp/ge-profile"], "sparkSubmitParameters": "--conf spark.archives=s3://'${S3_BUCKET}'/artifacts/pyspark/pyspark_ge.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python" } }' \ --configuration-overrides '{ "monitoringConfiguration": { "s3MonitoringConfiguration": { "logUri": "s3://'${S3_BUCKET}'/logs/" } } }'
And keep getting
Traceback (most recent call last): File "/tmp/spark-adc7d82f-ca5d-41ca-8d69-a89f144589a0/ge_profile.py", line 4, in
import great_expectations as ge
ModuleNotFoundError: No module named 'great_expectations'
I can confirm that the module is included in pyspark_ge.tar.gz
Thanks for the help
Eric