aws-samples / emr-serverless-samples

Example code for running Spark and Hive jobs on EMR Serverless.
https://aws.amazon.com/emr/serverless/
MIT No Attribution
155 stars 77 forks source link

EMR Serverless plugin in conflict with Airflow 2.2.2 constraints file #37

Closed MrThomasWagner closed 2 years ago

MrThomasWagner commented 2 years ago

Hi all,

I'm trying to use the latest release of the serverless plugin on MWAA with Airflow version 2.2.2: https://github.com/aws-samples/emr-serverless-samples/releases/tag/v1.0.1

The install is in conflict with the airflow v2.2.2 constraints file found here: https://raw.githubusercontent.com/apache/airflow/constraints-2.2.2/constraints-3.7.txt

Steps to reproduce

Requirements.txt contents:

--constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.2.2/constraints-3.7.txt"
emr_serverless @ https://github.com/aws-samples/emr-serverless-samples/releases/download/v1.0.1/mwaa_plugin.zip

Run:

pip3 install -r requirements.txt

Output:

The conflict is caused by:
    emr-serverless 1.0.1 depends on boto3>=1.23.9 and ~=1.23
    The user requested (constraint) boto3==1.18.65
marknorkin commented 2 years ago

Facing the same issue. Is there a workaround for this ?

dacort commented 2 years ago

I'm not quite sure why this started happening - will have to do some tests later this week.

Do you need the --constraint in there?

MrThomasWagner commented 2 years ago

Yea it does work ok without the constraints flag for my proof of concept - I have some more dependencies I'm going to want to add in the future and would like to be able to include it.

Awesome plugin btw

dacort commented 2 years ago

Unfortunately EMR Serverless requires a newer version of boto3 than what's in that constraints file. I don't know if there's a way to override that...

MrThomasWagner commented 2 years ago

I noticed it doesn't conflict with Airflow 2.4.2 which is out - MWAA is just a little behind on that. I.e.

https://raw.githubusercontent.com/apache/airflow/constraints-2.4.2/constraints-3.7.txt

dacort commented 2 years ago

Yup, MWAA is still on 2.2.2. I'm curious, can you help me understand why you're including the constraints line? I know you kind of mentioned it, but I'm still not sure what it's used for / why it's needed?

MrThomasWagner commented 2 years ago

I was following this best practices guide in the MWAA docs: https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-dependencies.html

There is an Option 2 there using wheel fwiw.. maybe I'll look into that if 2.2.2 is SOL

dacort commented 2 years ago

Ahhh got it thank you. Yea, the boto3 will be an issue just because of when EMR Serverless support was added to it.

marknorkin commented 2 years ago

@dacort thank you for response. Curious, what features of boto3>=1.23.9 and ~=1.23 are in use by emr serverless operators and sensors that are not present in boto3==1.18.65 ? We for example are using MWAA 2.2.2 on our project and EMR Serverless 6.7.0, and can not use the library because of this boto issue.

dacort commented 2 years ago

@marknorkin EMR Serverless was made generally available this year, and boto3 1.23.9 is when support for EMR Serverless was added. You can still use the Operator on MWAA 2.2.2, you just need to upgrade boto3 (which will happen automatically if you use the Operator from this repo).

I wasn't aware of the recommendation in our docs to add the constraints line to the requirements.txt - that said, I've tried this operator with the upgraded boto3 with MWAA and haven't seen any issues.

dacort commented 2 years ago

Going to close this for now as EMR Serverless requires a newer version of boto3. If you're willing to forego the constraints, you can still use the operator on MWAA, but I don't think there's a workaround. The Operator is in use in MWAA environments.

For reference, this is the dependency tree of the EMR Serverless operator. You could potentially update the constraints file with the relevant versions...or I do see now that there is a constraints-no-provider file as well. Maybe that'll help if the concern is preventing against upgrade of core libraries for Airflow?

https://raw.githubusercontent.com/apache/airflow/constraints-2.4.2/constraints-no-providers-3.7.txt

emr-serverless==1.0.1
  - boto3 [required: ~=1.23,>=1.23.9, installed: 1.26.10]
    - botocore [required: >=1.29.10,<1.30.0, installed: 1.29.10]
      - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
      - python-dateutil [required: >=2.1,<3.0.0, installed: 2.8.2]
        - six [required: >=1.5, installed: 1.16.0]
      - urllib3 [required: >=1.25.4,<1.27, installed: 1.26.12]
    - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
    - s3transfer [required: >=0.6.0,<0.7.0, installed: 0.6.0]
      - botocore [required: >=1.12.36,<2.0a.0, installed: 1.29.10]
        - jmespath [required: >=0.7.1,<2.0.0, installed: 1.0.1]
        - python-dateutil [required: >=2.1,<3.0.0, installed: 2.8.2]
          - six [required: >=1.5, installed: 1.16.0]
        - urllib3 [required: >=1.25.4,<1.27, installed: 1.26.12]
dlecina commented 2 years ago

Hello, even without constraints files, we are having this issue on a new MWAA 2.2.2 environment. Our only peculiarity is that we are hosting your released .zip file in our nexus repository (the file is unmodified):

adding trusted host: 'nexus.REDACTED' (from line 1 of /usr/local/airflow/requirements/requirements.txt)
adding trusted host: 'nexusmaster.REDACTED' (from line 2 of /usr/local/airflow/requirements/requirements.txt)
Looking in indexes: https://nexus.REDACTED/repository/pypi-public/simple/
Collecting emr_serverless@ https://nexusmaster.REDACTED/repository/REDACTED/REDACTED/mwaa_plugin.zip
  Downloading https://nexusmaster.REDACTED/repository/REDACTED/REDACTED/mwaa_plugin.zip (6.7 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting boto3>=1.23.9,~=1.23
  Downloading https://nexus.REDACTED/repository/pypi-public/packages/boto3/1.26.15/boto3-1.26.15-py3-none-any.whl (132 kB)
Collecting s3transfer<0.7.0,>=0.6.0
  Downloading https://nexus.REDACTED/repository/pypi-public/packages/s3transfer/0.6.0/s3transfer-0.6.0-py3-none-any.whl (79 kB)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in ./.local/lib/python3.7/site-packages (from boto3>=1.23.9,~=1.23->emr_serverless@ https://nexusmaster.REDACTED/repository/REDACTED/REDACTED/mwaa_plugin.zip->-r /usr/local/airflow/requirements/requirements.txt (line 5)) (0.10.0)
Collecting botocore<1.30.0,>=1.29.15
  Downloading https://nexus.REDACTED/repository/pypi-public/packages/botocore/1.29.15/botocore-1.29.15-py3-none-any.whl (9.9 MB)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in ./.local/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.15->boto3>=1.23.9,~=1.23->emr_serverless@ https://nexusmaster.REDACTED/repository/REDACTED/REDACTED/mwaa_plugin.zip->-r /usr/local/airflow/requirements/requirements.txt (line 5)) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in ./.local/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.15->boto3>=1.23.9,~=1.23->emr_serverless@ https://nexusmaster.REDACTED/repository/REDACTED/REDACTED/mwaa_plugin.zip->-r /usr/local/airflow/requirements/requirements.txt (line 5)) (1.26.7)
Requirement already satisfied: six>=1.5 in ./.local/lib/python3.7/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.30.0,>=1.29.15->boto3>=1.23.9,~=1.23->emr_serverless@ https://nexusmaster.REDACTED/repository/REDACTED/REDACTED/mwaa_plugin.zip->-r /usr/local/airflow/requirements/requirements.txt (line 5)) (1.16.0)
Building wheels for collected packages: emr-serverless
  Building wheel for emr-serverless (setup.py): started
  Building wheel for emr-serverless (setup.py): finished with status 'done'
  Created wheel for emr-serverless: filename=emr_serverless-1.0.1-py3-none-any.whl size=7414 sha256=da8ce9ab8a2ff91d9a3b883ddaafbc3c9e892133a4ffb499e420236b70068f0f
  Stored in directory: /tmp/pip-ephem-wheel-cache-lpa7pkzp/wheels/13/92/50/475b17c65c8d67d0c9ecba04a3df4e16188d880c57c8d90d8f
Successfully built emr-serverless
Installing collected packages: botocore, s3transfer, boto3, emr-serverless
  Attempting uninstall: botocore
    Found existing installation: botocore 1.21.65
    Uninstalling botocore-1.21.65:
      Successfully uninstalled botocore-1.21.65
  Attempting uninstall: s3transfer
    Found existing installation: s3transfer 0.5.0
    Uninstalling s3transfer-0.5.0:
      Successfully uninstalled s3transfer-0.5.0
  Attempting uninstall: boto3
    Found existing installation: boto3 1.18.65
    Uninstalling boto3-1.18.65:
      Successfully uninstalled boto3-1.18.65
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
apache-airflow-providers-amazon 2.4.0 requires boto3<1.19.0,>=1.15.0, but you have boto3 1.26.15 which is incompatible.
apache-airflow-providers-amazon 2.4.0 requires watchtower~=1.0.6, but you have watchtower 2.0.1 which is incompatible.
Successfully installed boto3-1.26.15 botocore-1.29.15 emr-serverless-1.0.1 s3transfer-0.6.0

Our requirements file is as follows:

--trusted-host nexus.REDACTED
--trusted-host nexusmaster.REDACTED
--index https://nexus.REDACTED/repository/pypi-public/
--index-url https://nexus.REDACTED/repository/pypi-public/simple/
emr_serverless @ https://nexusmaster.REDACTED/repository/REDACTED/REDACTED/mwaa_plugin.zip

As a bit of an aside, we have tried getting around this by setting this:

apache-airflow==2.2.2
apache-airflow-providers-amazon>=v5.1.0

This solves the version issue and install works correctly everywhere except on WebServer (as in https://repost.aws/questions/QUmgPhWhgmTFGMc18d7De40A/airflow-webserver-not-installing-python-requirements). However, if we set this and then try to use the operator in a DAG, the DAG gets processed correctly, but we never get a Task to actually run. We have also tried this with different versions of apache-airflow-providers-amazon (3.1.1, 5.1.0, 6.0.0). In the latter case we removed mwaa_plugin.zip as the library itself should already be providing the operator. We are unsure of the reason why this is not working (it may be our fault), hence why we are not opening a new issue yet.

In any case, we just wanted to let you know that just setting the emr_serverless requirement is not working for us, even without constraints.

dacort commented 2 years ago

@dlecina Interesting, thank you for all the detail. I know the MWAA team has been doing some work on Python requirements lately so I wonder if something changed here.

I will try to reproduce this and reopen this if I run into the same. Between the US holiday this week and re:Invent next week it may take me a bit, but I'll try to take a look ASAP.

dlecina commented 2 years ago

Thanks @dacort! Yes, I expect there have been some changes in the background that explain the different behavior.

In case it's helpful to anyone, in the end the following combination seemed to work for us; we were able to reach EMR Serverless with this:

--trusted-host nexus.REDACTED
--trusted-host nexusmaster.REDACTED
--index https://nexus.REDACTED/repository/pypi-public/
--index-url https://nexus.REDACTED/repository/pypi-public/simple/
apache-airflow==2.2.2
apache-airflow-providers-amazon==6.0.0
boto>=1.23.9

Context: Setting apache-airflow-providers-amazon==6.1.0 would be ideal, as it has the correct boto requirement, but then it demands apache-airflow>=2.3.0, which does not work with MWAA 2.2.2, so instead we set boto explicitely and that seemed to work as it does not conflict with either library. Not setting boto explicitely does not work in this configuration because, despite 6.0.0 having the EMR Serverless Operator, the boto requirement is set to an older version which does not have the emr-serverless API and it will fail when running the task.

In short: apache-airflow-providers-amazon==6.1.0 -> apache-airflow>=2.3.0 ❌ boto3>=1.24.0 ✔️ apache-airflow-providers-amazon==6.0.0 -> apache-airflow>=2.2.0 ✔️ boto3>=1.15.0 ❌ apache-airflow-providers-amazon==6.0.0 + boto>=1.23.9 -> apache-airflow>=2.2.0 ✔️ boto>=1.23.9 ✔️

dacort commented 2 years ago

Just to confirm, I was still able to use MWAA 2.2.2 with the release from this repository without a problem.

My requirements file is just this plugin, though.

emr_serverless @ https://github.com/aws-samples/emr-serverless-samples/releases/download/v1.0.1/mwaa_plugin.zip

I'm using the CDK stack from this repository.

I'll also try with the constraints-no-provider file as well and see if that works.