Closed amin010 closed 2 years ago
Hi @amin010 - how did you build your venv?We have a note in the docs regarding this:
You must run the following commands in a similar Amazon Linux 2 environment with the same version of Python as you use in EMR Serverless, that is, Python 3.7.10 for EMR 6.6.0.
There's a sample Dockerfile in this repo that uses amazonlinux:2
as the base container image. Let me know!
I think you're right, I just used Python 3.7.10 and not Amazon Linux 2. The error message "No such file or directory" threw me off. Thanks.
Thanks for following up. I do think we need to make it more clear in the docs - I've had other folks run into similar issues.
Same error as @amin010 (not sure if emr unpacks the tar.gz in the correct folder) , then I packed it in a .zip, now I have
Cannot run program "./environment/bin/python3.9": error=13, Permission denied
Already followed the docker steps using amazonlinux:2
. Thanks
@StevenTollanis Based on your error message, it looks like you're using a different version of Python (3.9 vs 3.7). Can you provide the Dockerfile you used?
EMR Serverless can use a diff version of Python but it requires some additional steps.
I also verified that the python3 version used in the amazonlinux:2
image is still 3.7.
Enter bash in the image
docker run --rm -it amazonlinux:2 /bin/bash
Install and check python version
yum install -y python3 && python 3 --version
Sample output
bash-4.2# python3 --version
Python 3.7.15
@dacort This is my Dockerfile
`FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install gcc openssl-devel bzip2-devel libffi-devel gzip make -y RUN yum install wget tar -y WORKDIR /opt RUN wget https://www.python.org/ftp/python/3.9.6/Python-3.9.6.tgz RUN tar xzf Python-3.9.6.tgz WORKDIR /opt/Python-3.9.6 RUN ./configure --enable-optimizations RUN make altinstall RUN rm -f /opt/Python-3.9.6.tgz
ENV VIRTUAL_ENV=/opt/venv RUN python3.9 -m venv $VIRTUAL_ENV ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3.9 -m pip install --upgrade pip && \ python3.9 -m pip install \ boto3 pandas pyspark \ venv-pack==0.2.0
RUN mkdir /output && venv-pack -o /output/emr-python.tar.gz
FROM scratch AS export COPY --from=base /output/emr-python.tar.gz / `
@dacort Ok, gonna check the requires some additional steps. and try again first using tar.gz. Thanks
@StevenTollanis cool couple specific things to call out:
sparkSubmitParameters
when you run the job, specifically for the driverEnv (and maybe executorEnv) config items. @dacort I rebuilt my Dockerfile based on your custom Dockerfile, but it went back to the error I got the first time:
Cannot run program "./environment/bin/python": error=13, Permission denied
Weird thing still the only packed format that it could read was .zip. I think I'm going to try with the EMR base python version 3.7 to clear any doubts about compatibility. Did this custom examples worked for you?
Note the sparkSubmitParameters when you run the job
Yes, I cleaned those following the aws docs at https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-python-libraries.html
@dacort Unbelievable, still Cannot run program "./environment/bin/python": error=13, Permission denied
, but now using the default python 3.7 version as you mentioned at Python Dependencies . BTW, also tested your custom Dockerfile as is, but the same error.
Here my log if any help:
Unpacking an archive s3://test-bucket/emr-python.zip#environment from /tmp/spark-big-number/emr-python.zip to /home/hadoop/./environment
Exception in thread "main" java.io.IOException: Cannot run program "./environment/bin/python": error=13, Permission denied
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:105)
at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1006)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1095)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1104)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=13, Permission denied
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 14 more
22/12/27 16:38:55 INFO ShutdownHookManager: Shutdown hook called
22/12/27 16:38:55 INFO ShutdownHookManager: Deleting directory /tmp/spark-big-number
PS: @amin010 I wonder if you solved this issue using EMR serverless, or if switched to another tool such as Glue or went the cluster way to execute these jobs. Thanks
PS: @amin010 I wonder if you solved this issue using EMR serverless, or if switched to another tool such as Glue or went the cluster way to execute these jobs. Thanks
Yes, as mentioned in the thread, my issue got resolved, that being said, I haven't used it since then so not aware of possible new issues.
also tested your custom Dockerfile as is, but the same error.
I re-tested the instructions given there with EMR release version 6.6.0 and 6.9.0 and it seemed to work fine. The permission denied error seems like the python executable might not be executable by users other than root? Going to try to test the Dockerfile you provided as well.
If you can provide step-by-step instructions that lead to the failure, that could help, but so far I can't tell where things might be going wrong.
@dacort ....hold on, I'm seeing that sagemaker can handle my dependencies, that's so useful, any thoughts?
@StevenTollanis That's a tough question to answer in this medium. I suppose the answer will be influenced by what you are trying to accomplish at the end of the day. SageMaker is an end-to-end machine learning solution - so if you're building/training/deploying ML models to the cloud, it's a good approach. If you have pre-existing Spark jobs and want to run them on EMR, Serverless is a great way to start. If you need an end-to-end data integration platform, Glue is a good place to look.
Sorry you're having issues getting going with Serverless. For a simple dependency like pandas, though, the instructions in the dependencies section should work fine. I would start there again, replace great_expectations
with pandas
in the Dockerfile and try to run a sample pandas job.
@StevenTollanis That's a tough question to answer in this medium. I suppose the answer will be influenced by what you are trying to accomplish at the end of the day. SageMaker is an end-to-end machine learning solution - so if you're building/training/deploying ML models to the cloud, it's a good approach. If you have pre-existing Spark jobs and want to run them on EMR, Serverless is a great way to start. If you need an end-to-end data integration platform, Glue is a good place to look.
Sorry you're having issues getting going with Serverless. For a simple dependency like pandas, though, the instructions in the dependencies section should work fine. I would start there again, replace
great_expectations
withpandas
in the Dockerfile and try to run a sample pandas job.
Thanks, I think I did started again with your clean Dockefile, but it threw the same error=13, Permission denied
error, anyways, I'll try one more time again. The test I did using sagemaker processing jobs went very well and smooth (no dependencies building). What do you think about using Step functions with lambdas to do the same ETL/spark job, any scalability issues? Happy New Year
First, I thought the issue was that I used a different image (I used linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2
, cause of rate limit for the linux/amd64 amazonlinux:2
on codebuild), then I built it locally using linux/amd64 amazonlinux:2
and just added pandas to your custom Dockerfile ... But, again Cannot run program "./environment/bin/python": error=13, Permission denied
. Also, checked chmod and just did chmod a+rx
to the zip, but the same error.
BTW, Just rechecked the AWS user guide for using python libraries and found that it misses the property spark.emr-serverless.executorEnv.PYSPARK_PYTHON=./environment/bin/python
which your samples have. Also tested this on my script, but still Cannot run program "./environment/bin/python": error=13, Permission denied
@StevenTollanis I'm sorry, I'm not sure why it's not working for you. I've run the steps manually and it works fine for me. If you want to produce a detailed set of steps/artifacts including your Dockerfile and AWS CLI commands that cause this issue for you, I can help debug, but at the moment I'm not able to reproduce the issue.
I've created a simple script that uses the custom Dockerfile, adds pandas, and runs a successful job as an example: https://gist.github.com/dacort/9ad6f5712fca1c08618582f6674e4303
@StevenTollanis I'm sorry, I'm not sure why it's not working for you. I've run the steps manually and it works fine for me. If you want to produce a detailed set of steps/artifacts including your Dockerfile and AWS CLI commands that cause this issue for you, I can help debug, but at the moment I'm not able to reproduce the issue.
I've created a simple script that uses the custom Dockerfile, adds pandas, and runs a successful job as an example: https://gist.github.com/dacort/9ad6f5712fca1c08618582f6674e4303
BTW, I created the EMR serverless app and ran the jobs using the console, not sending the commands via CLI. That should work either way, right?
Should work either way, yes.
@dacort As you mentioned The permission denied error seems like the python executable might not be executable by users other than root?
Your script from gist help me find out that docker created the tar/zip as root. Had to made changes in my linux vm for docker to create the artifact as the current machine user, I guess it's the same process for a build pipeline. Thank very much for your help.
PS: I'm probably going to use another of the services mentioned above. EMR still doesn't have that "developer friendliness" just as Lambda for managing dependencies or even a IDE included. Hopefully we could get that soon.
@StevenTollanis Interesting. And you were able to successfully run the job after you made the changes? The python executable that's in the tar.gz (when I build the image on macos locally) should still have a user-executable python, but I'll have to give it a shot on Linux as well.
tar tzvf pyspark_3.10.6.tar.gz | grep bin/python
-rwxr-xr-x 0 root root 19801640 Dec 27 10:28 bin/python3.10
-rwxr-xr-x 0 root root 19801640 Dec 27 10:28 bin/python
-rwxr-xr-x 0 root root 19801640 Dec 27 10:28 bin/python3
Thanks so much for continuing to dig in on this. We're definitely working on improving the developer experience. We've released a developer preview of a VS Code extension ( https://marketplace.visualstudio.com/items?itemName=AmazonEMR.emr-tools ) to help with local development and will be adding additional tooling to make building/deploying jobs easier. :)
@dacort Yes, I could ran the job. Of course, after that I even got errors like No module named 'boto3'
, so have to included it in the tar.gz. Should'nt that be already available to import?
Yes, as your tar.gz files permissions are, mine are the same:
tar tzvf pyspark_3.10.6.tar.gz | grep bin/python
-rwxr-xr-x 0 root root 19801640 2023-01-03 10:28 bin/python3.10
-rwxr-xr-x 0 root root 19801640 2023-01-03 10:28 bin/python
-rwxr-xr-x 0 root root 19801640 2023-01-03 10:28 bin/python3
But my pyspark_3.10.6.tar.gz was packaged using my current machine user, not the root user which docker did it by default:
-rw------ 1 myuser myuser 119801640 2023-01-03 10:28 pyspark_3.10.6.tar.gz [This worked]
-rw------ 1 root root 119801640 2023-01-03 10:28 pyspark_3.10.6.tar.gz [This didn't]
Thanks for the tip on the EMR VS Code extension
No module named 'boto3'
, so have to included it in the tar.gz. Should'nt that be already available to import?
If you're using a custom python version, you likely need to install boto3. I believe boto3 is available in the base image, but with a custom python version, I'm guessing it also uses the libraries installed only for that version.
FYI, you can now also use custom images with EMR Serverless, which might make installing dependencies a bit easier. :) See the build a data science image for details and the docs on how to use custom images.
Note that the image url in the first link is slightly incorrect - it should be spark
not spark3
- we're updating the docs on that shortly
As detailed here and here we should be able to install and use a python venv:
but that doesn't seem to work. The application fails with this error: