aws-samples / emr-serverless-samples

Example code for running Spark and Hive jobs on EMR Serverless.
https://aws.amazon.com/emr/serverless/
MIT No Attribution
150 stars 74 forks source link

The suggested way of using Python libraries with EMR Serverless does not work #16

Closed amin010 closed 2 years ago

amin010 commented 2 years ago

As detailed here and here we should be able to install and use a python venv:

--conf spark.archives=s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment 
--conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
--conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python 
--conf spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python 

but that doesn't seem to work. The application fails with this error:

Unpacking an archive s3://DOC-EXAMPLE-BUCKET/EXAMPLE-PREFIX/pyspark_venv.tar.gz#environment from /tmp/spark-02908b0e-9b64-469d-xxx-xxxxxxxx/pyspark_venv.tar.gz to /home/hadoop/./environment
Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:105)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1003)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1092)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1101)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=2, No such file or directory
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 14 more
22/07/20 05:36:18 INFO ShutdownHookManager: Shutdown hook called
22/07/20 05:36:18 INFO ShutdownHookManager: Deleting directory /tmp/spark-02908b0e-9b64-469d-b094-edee291a2426
dacort commented 2 years ago

Hi @amin010 - how did you build your venv?We have a note in the docs regarding this:

You must run the following commands in a similar Amazon Linux 2 environment with the same version of Python as you use in EMR Serverless, that is, Python 3.7.10 for EMR 6.6.0.

There's a sample Dockerfile in this repo that uses amazonlinux:2 as the base container image. Let me know!

amin010 commented 2 years ago

I think you're right, I just used Python 3.7.10 and not Amazon Linux 2. The error message "No such file or directory" threw me off. Thanks.

dacort commented 2 years ago

Thanks for following up. I do think we need to make it more clear in the docs - I've had other folks run into similar issues.

StevenTollanis commented 1 year ago

Same error as @amin010 (not sure if emr unpacks the tar.gz in the correct folder) , then I packed it in a .zip, now I have

Cannot run program "./environment/bin/python3.9": error=13, Permission denied

Already followed the docker steps using amazonlinux:2. Thanks

dacort commented 1 year ago

@StevenTollanis Based on your error message, it looks like you're using a different version of Python (3.9 vs 3.7). Can you provide the Dockerfile you used?

EMR Serverless can use a diff version of Python but it requires some additional steps.

I also verified that the python3 version used in the amazonlinux:2 image is still 3.7.

StevenTollanis commented 1 year ago

@dacort This is my Dockerfile

`FROM --platform=linux/amd64 amazonlinux:2 AS base

RUN yum install gcc openssl-devel bzip2-devel libffi-devel gzip make -y RUN yum install wget tar -y WORKDIR /opt RUN wget https://www.python.org/ftp/python/3.9.6/Python-3.9.6.tgz RUN tar xzf Python-3.9.6.tgz WORKDIR /opt/Python-3.9.6 RUN ./configure --enable-optimizations RUN make altinstall RUN rm -f /opt/Python-3.9.6.tgz

ENV VIRTUAL_ENV=/opt/venv RUN python3.9 -m venv $VIRTUAL_ENV ENV PATH="$VIRTUAL_ENV/bin:$PATH"

RUN python3.9 -m pip install --upgrade pip && \ python3.9 -m pip install \ boto3 pandas pyspark \ venv-pack==0.2.0

RUN mkdir /output && venv-pack -o /output/emr-python.tar.gz

FROM scratch AS export COPY --from=base /output/emr-python.tar.gz / `

StevenTollanis commented 1 year ago

@dacort Ok, gonna check the requires some additional steps. and try again first using tar.gz. Thanks

dacort commented 1 year ago

@StevenTollanis cool couple specific things to call out:

StevenTollanis commented 1 year ago

@dacort I rebuilt my Dockerfile based on your custom Dockerfile, but it went back to the error I got the first time: Cannot run program "./environment/bin/python": error=13, Permission denied

Weird thing still the only packed format that it could read was .zip. I think I'm going to try with the EMR base python version 3.7 to clear any doubts about compatibility. Did this custom examples worked for you?

Note the sparkSubmitParameters when you run the job Yes, I cleaned those following the aws docs at https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html

StevenTollanis commented 1 year ago

@dacort Unbelievable, still Cannot run program "./environment/bin/python": error=13, Permission denied, but now using the default python 3.7 version as you mentioned at Python Dependencies . BTW, also tested your custom Dockerfile as is, but the same error.

Here my log if any help:

Unpacking an archive s3://test-bucket/emr-python.zip#environment from /tmp/spark-big-number/emr-python.zip to /home/hadoop/./environment
Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=13, Permission denied
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:105)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1006)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1095)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1104)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=13, Permission denied
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 14 more
22/12/27 16:38:55 INFO ShutdownHookManager: Shutdown hook called
22/12/27 16:38:55 INFO ShutdownHookManager: Deleting directory /tmp/spark-big-number

PS: @amin010 I wonder if you solved this issue using EMR serverless, or if switched to another tool such as Glue or went the cluster way to execute these jobs. Thanks

amin010 commented 1 year ago

PS: @amin010 I wonder if you solved this issue using EMR serverless, or if switched to another tool such as Glue or went the cluster way to execute these jobs. Thanks

Yes, as mentioned in the thread, my issue got resolved, that being said, I haven't used it since then so not aware of possible new issues.

dacort commented 1 year ago

also tested your custom Dockerfile as is, but the same error.

I re-tested the instructions given there with EMR release version 6.6.0 and 6.9.0 and it seemed to work fine. The permission denied error seems like the python executable might not be executable by users other than root? Going to try to test the Dockerfile you provided as well.

If you can provide step-by-step instructions that lead to the failure, that could help, but so far I can't tell where things might be going wrong.

StevenTollanis commented 1 year ago

@dacort ....hold on, I'm seeing that sagemaker can handle my dependencies, that's so useful, any thoughts?

dacort commented 1 year ago

@StevenTollanis That's a tough question to answer in this medium. I suppose the answer will be influenced by what you are trying to accomplish at the end of the day. SageMaker is an end-to-end machine learning solution - so if you're building/training/deploying ML models to the cloud, it's a good approach. If you have pre-existing Spark jobs and want to run them on EMR, Serverless is a great way to start. If you need an end-to-end data integration platform, Glue is a good place to look.

Sorry you're having issues getting going with Serverless. For a simple dependency like pandas, though, the instructions in the dependencies section should work fine. I would start there again, replace great_expectations with pandas in the Dockerfile and try to run a sample pandas job.

StevenTollanis commented 1 year ago

@StevenTollanis That's a tough question to answer in this medium. I suppose the answer will be influenced by what you are trying to accomplish at the end of the day. SageMaker is an end-to-end machine learning solution - so if you're building/training/deploying ML models to the cloud, it's a good approach. If you have pre-existing Spark jobs and want to run them on EMR, Serverless is a great way to start. If you need an end-to-end data integration platform, Glue is a good place to look.

Sorry you're having issues getting going with Serverless. For a simple dependency like pandas, though, the instructions in the dependencies section should work fine. I would start there again, replace great_expectations with pandas in the Dockerfile and try to run a sample pandas job.

Thanks, I think I did started again with your clean Dockefile, but it threw the same error=13, Permission denied error, anyways, I'll try one more time again. The test I did using sagemaker processing jobs went very well and smooth (no dependencies building). What do you think about using Step functions with lambdas to do the same ETL/spark job, any scalability issues? Happy New Year

StevenTollanis commented 1 year ago

First, I thought the issue was that I used a different image (I used linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2, cause of rate limit for the linux/amd64 amazonlinux:2 on codebuild), then I built it locally using linux/amd64 amazonlinux:2 and just added pandas to your custom Dockerfile ... But, again Cannot run program "./environment/bin/python": error=13, Permission denied . Also, checked chmod and just did chmod a+rx to the zip, but the same error.

StevenTollanis commented 1 year ago

BTW, Just rechecked the AWS user guide for using python libraries and found that it misses the property spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python which your samples have. Also tested this on my script, but still Cannot run program "./environment/bin/python": error=13, Permission denied

dacort commented 1 year ago

@StevenTollanis I'm sorry, I'm not sure why it's not working for you. I've run the steps manually and it works fine for me. If you want to produce a detailed set of steps/artifacts including your Dockerfile and AWS CLI commands that cause this issue for you, I can help debug, but at the moment I'm not able to reproduce the issue.

I've created a simple script that uses the custom Dockerfile, adds pandas, and runs a successful job as an example: https://gist.github.com/dacort/9ad6f5712fca1c08618582f6674e4303

StevenTollanis commented 1 year ago

@StevenTollanis I'm sorry, I'm not sure why it's not working for you. I've run the steps manually and it works fine for me. If you want to produce a detailed set of steps/artifacts including your Dockerfile and AWS CLI commands that cause this issue for you, I can help debug, but at the moment I'm not able to reproduce the issue.

I've created a simple script that uses the custom Dockerfile, adds pandas, and runs a successful job as an example: https://gist.github.com/dacort/9ad6f5712fca1c08618582f6674e4303

BTW, I created the EMR serverless app and ran the jobs using the console, not sending the commands via CLI. That should work either way, right?

dacort commented 1 year ago

Should work either way, yes.

StevenTollanis commented 1 year ago

@dacort As you mentioned The permission denied error seems like the python executable might not be executable by users other than root? Your script from gist help me find out that docker created the tar/zip as root. Had to made changes in my linux vm for docker to create the artifact as the current machine user, I guess it's the same process for a build pipeline. Thank very much for your help.

PS: I'm probably going to use another of the services mentioned above. EMR still doesn't have that "developer friendliness" just as Lambda for managing dependencies or even a IDE included. Hopefully we could get that soon.

dacort commented 1 year ago

@StevenTollanis Interesting. And you were able to successfully run the job after you made the changes? The python executable that's in the tar.gz (when I build the image on macos locally) should still have a user-executable python, but I'll have to give it a shot on Linux as well.

tar tzvf pyspark_3.10.6.tar.gz | grep bin/python
-rwxr-xr-x  0 root   root 19801640 Dec 27 10:28 bin/python3.10
-rwxr-xr-x  0 root   root 19801640 Dec 27 10:28 bin/python
-rwxr-xr-x  0 root   root 19801640 Dec 27 10:28 bin/python3

Thanks so much for continuing to dig in on this. We're definitely working on improving the developer experience. We've released a developer preview of a VS Code extension ( https://marketplace.visualstudio.com/items?itemName=AmazonEMR.emr-tools ) to help with local development and will be adding additional tooling to make building/deploying jobs easier. :)

StevenTollanis commented 1 year ago

@dacort Yes, I could ran the job. Of course, after that I even got errors like No module named 'boto3', so have to included it in the tar.gz. Should'nt that be already available to import?

Yes, as your tar.gz files permissions are, mine are the same:

tar tzvf pyspark_3.10.6.tar.gz | grep bin/python
-rwxr-xr-x  0 root   root 19801640 2023-01-03 10:28 bin/python3.10
-rwxr-xr-x  0 root   root 19801640 2023-01-03 10:28 bin/python
-rwxr-xr-x  0 root   root 19801640 2023-01-03 10:28 bin/python3

But my pyspark_3.10.6.tar.gz was packaged using my current machine user, not the root user which docker did it by default:

-rw------  1  myuser myuser 119801640 2023-01-03 10:28 pyspark_3.10.6.tar.gz  [This worked]
-rw------  1  root  root  119801640 2023-01-03 10:28 pyspark_3.10.6.tar.gz [This didn't]

Thanks for the tip on the EMR VS Code extension

dacort commented 1 year ago

No module named 'boto3', so have to included it in the tar.gz. Should'nt that be already available to import?

If you're using a custom python version, you likely need to install boto3. I believe boto3 is available in the base image, but with a custom python version, I'm guessing it also uses the libraries installed only for that version.

FYI, you can now also use custom images with EMR Serverless, which might make installing dependencies a bit easier. :) See the build a data science image for details and the docs on how to use custom images.

Note that the image url in the first link is slightly incorrect - it should be spark not spark3 - we're updating the docs on that shortly