aws-samples / emr-serverless-samples

Example code for running Spark and Hive jobs on EMR Serverless.
https://aws.amazon.com/emr/serverless/
MIT No Attribution
155 stars 78 forks source link

Custom python versions >= 3.10 fail on EMR Studio/Jupyter due to a badly patched version of livy #57

Closed hendrikmuhs closed 10 months ago

hendrikmuhs commented 10 months ago

This repository contains a great example of using a more recent python interpreter on EMR serverless.

Using that example I am able to use a custom python3.11 + preinstalled modules venv. This works fine for spark-submit jobs. In interactive mode, namely EMR Studio, I can also use my custom venv. However the deployed version of jupyter, in pariticular ipython, has compatibility issues with newer versions of python. Some commands fail with:

An error was encountered:
required field "type_ignores" missing from Module
Traceback (most recent call last):
  File "/tmp/6833554925722006797", line 226, in execute
    code = compile(mod, '<stdin>', 'exec')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: required field "type_ignores" missing from Module

I see 2 possible solutions to this:

But maybe I miss something? Is there another way? Do I misinterpret the stacktrace?

It is possible to develop locally of course, however the data/computation should happen in AWS.

hendrikmuhs commented 10 months ago

I've spotted the bug, it is caused by a badly patched livy. This change does not exist in the original version.

AWS engineering has patched this file as follows:

                if sys.version < "3.8":
                    mod = ast.Module([node])
                else:
                    mod = ast.Module([node], [])

for 3.10, 3.11 < "3.8" returns True 🤦

This bad code can be found on line 222 of fake_shell.py, which is part of emr-serverless-livy, which is part of the amazons EMR image.

I could not find a corresponding repository, probably this sits in a private fork of AWS. I unfortunately don't know how to file a bug against that repo.

@dacort can you help?

dacort commented 10 months ago

Hi @hendrikmuhs - Thanks for opening this issue and digging in to find the issue. Livy definitely is problematic for python versions >3.8. We've fixed that to some extent in EMR, but not for Python >=3.10 as you've pointed out.

I'll have to look into where that is with EMR Serverless and/or what options there are. You can't unfortunately run your own Jupyter server at this time to connect to EMR-S.

One quick question - when you mention emr-serverless-livy - where did you see that package?

hendrikmuhs commented 10 months ago

@dacort thanks for the quick feedback

Repo steps:

Pull the official AWS EMR image and run a shell:

docker run --user root -it --entrypoint /bin/bash public.ecr.aws/emr-serverless/spark/emr-7.0.0

In the shell:

bash-5.2# dnf list --installed emr-serverless-livy
Installed Packages
emr-serverless-livy.noarch                                                 0.7.1-1.amzn2023                                                  @System

The file in question is part of a jar:

bash-5.2# dnf repoquery --installed -l emr-serverless-livy

You get to the code when you unzip this file from the package:

unzip /usr/lib/livy/repl_2.12-jars/livy-repl_2.12-0.7.1-incubating.jar
dacort commented 10 months ago

@hendrikmuhs Awesome, thank you!

While I'm repro'ing on my side, you could try patching the fake_shell.py in that jar and see if it helps. I have an example of doing that, although it's for an older version of Livy (and EMR on EC2) so the patch file itself probably wouldn't work. But gives a script to unpack/patch/pack the jar back up. https://gist.github.com/dacort/df1fba8b1e0cc7ef341d713e25ebf1a4

hendrikmuhs commented 10 months ago

Yeah, was thinking the same. Thanks for the gist, this makes it easier for me!

dacort commented 10 months ago

Confirmed that if you want to work around this, it's possible with a custom image. This is the Dockerfile I used for EMR 7:

FROM --platform=linux/amd64 public.ecr.aws/emr-serverless/spark/emr-7.0.0:latest
USER root

RUN dnf install python3.11

WORKDIR /tmp
RUN jar xf /usr/lib/livy/repl_2.12-jars/livy-repl_2.12-0.7.1-incubating.jar fake_shell.py && \
    sed -ie 's/version < \"3\.8\"/version_info < \(3,8\)/' fake_shell.py && \
    jar uvf /usr/lib/livy/repl_2.12-jars/livy-repl_2.12-0.7.1-incubating.jar fake_shell.py
WORKDIR /home/hadoop

ENV PYSPARK_PYTHON=/usr/bin/python3.11

USER hadoop:hadoop

I've also opened/bumped an internal ticket for this issue.

dacort commented 10 months ago

Updated the doc in https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/custom_python_version with examples for both EMR 6.x and 7.x and a note about Python >= 3.10. Closing this for now.