apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.41k stars 14.11k forks source link

Snowflake Provider - Hook's support for not providing a region is broken when using SQLAlchemy #20032

Closed mattpolzin closed 2 years ago

mattpolzin commented 2 years ago

Apache Airflow Provider(s)

snowflake

Versions of Apache Airflow Providers

Versions 2.2.x (since https://github.com/apache/airflow/commit/0a37be3e3cf9289f63f1506bc31db409c2b46738).

Apache Airflow version

2.2.1

Operating System

Debian GNU/Linux 10 (buster)

Deployment

Other 3rd-party Helm chart

Deployment details

Bitnami Airflow Helm chart @ version 8.0.2

What happened

When connecting to Snowflake via SQLAlchemy using the Snowflake Hook, I get an error that the URL is not valid because my Snowflake instance is in US West 2 (Oregon) which means I don't provide a region explicitly. Snowflake's documentation says:

If the account is located in the AWS US West (Oregon) region, no additional segments are required and the URL would be xy12345.snowflakecomputing.com

The error is that xy12345..snowflakecomputing.com is not a valid URL (note the double-dot caused by the lack of a region).

What you expected to happen

I expect the connection to be successful.

How to reproduce

You can use the default snowflake connection if you have one defined and see this problem with the following one-liner:

python -c 'from airflow.providers.snowflake.hooks.snowflake import SnowflakeHook; SnowflakeHook().get_sqlalchemy_engine().connect()'

Anything else

Fortunately I imagine the fix for this is just to leave the region URL component out when region is None here: https://github.com/apache/airflow/commit/0a37be3e3cf9289f63f1506bc31db409c2b46738#diff-2b674ac999a5b938fe5045f6475b0c5cc76e4cab89174ac448a9e1d41a5c04d5R215.

Using version 2.1.1 of the Snowflake provider with version 2.2.1 of Airflow is currently a viable workaround so for now I am just avoiding the update to the provider.

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 2 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk commented 2 years ago

Why don't you submit a PR with a fix @mattpolzin ? Sounds lke an eaasy fix.

mattpolzin commented 2 years ago

I appreciate the encouragement! I've got a lot of projects in progress right now so I don't think I'll find time to get set up here as well.

potiuk commented 2 years ago

I marked it as good-first-issue, so maybe someone else will pick it up then !

harishkrao commented 2 years ago

Is this available to pick up and implement?

potiuk commented 2 years ago

Sure @harishkrao . Assigned you.

potiuk commented 2 years ago

But maybe @mik-laj would like to do it firsst - so agree between yourselves ;)

harishkrao commented 2 years ago

Thank you for getting back @potiuk. @mik-laj how would you like to proceed?

mik-laj commented 2 years ago

Feel free to create a PR. I will take care of the review.

harishkrao commented 2 years ago

Great, I will do so. Thank you.

harishkrao commented 2 years ago

@mik-laj I have made the change, I have a question on the unit tests. They are failing because the connection sends the region name. Do you have a suggestion on how to test without a region name? https://github.com/harishkrao/airflow/blob/sqlalchemy-snowflake-region/tests/providers/snowflake/hooks/test_snowflake.py#L94

mik-laj commented 2 years ago

@harishkrao Hi. I improved the tests yesterday and I think now it will be easy to write the new test case to handle this sitution.. See: https://github.com/apache/airflow/pull/20095 Now all you have to do is add a test case here. https://github.com/apache/airflow/blob/545ca59ba9a0b346cbbf28cc6958f9575e5e6b0b/tests/providers/snowflake/hooks/test_snowflake.py#L76-L95 To build a URL, we shouldn't use string.format. Instead of it, we should use snowflake.sqlalchemy.URL. See: https://github.com/snowflakedb/snowflake-sqlalchemy#connection-parameters The main problem with concatenation is the problem with handling of special characters due to the lack of URL-encoding.

harishkrao commented 2 years ago

Thank you @mik-laj. I will write the test cases based on the new tests you pushed yesterday. Also thank you for the SQL Alchemy URL build link. I will use it to construct the URI.

harishkrao commented 2 years ago

@mik-laj Made the below changes, running pytests pending, will let you know after running them tomorrow: https://github.com/harishkrao/airflow/blob/fix-sqlalchemy-snowflake-region/airflow/providers/snowflake/hooks/snowflake.py#L216-L239 https://github.com/harishkrao/airflow/blob/fix-sqlalchemy-snowflake-region/tests/providers/snowflake/hooks/test_snowflake.py#L268-L279

harishkrao commented 2 years ago

@mik-laj How did you manage dependencies when running pytests? I am facing WARNING - Exception when importing 'airflow.providers.snowflake.hooks.snowflake.SnowflakeHook' from 'apache-airflow-providers-snowflake' package: No module named 'sqlalchemy.sql.roles' Both snowflake-sqlalchemy and sqlalchemy are installed.

mik-laj commented 2 years ago

Airflow Breeze

mik-laj commented 2 years ago

You can also use constraint files. See: https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html

potiuk commented 2 years ago

See also complete description here: https://github.com/apache/airflow/blob/main/TESTING.rst#airflow-unit-tests @harishkrao

harishkrao commented 2 years ago

Thank you @mik-laj and @potiuk, I will proceed.

harishkrao commented 2 years ago

I narrowed down the scope of our issue here. Once I am inside the virtual environment, and I try to import the same SQLAlchemy package, I run into the same error: `root@66ecbec1a1f0:/opt/airflow# python Python 3.8.12 (default, Nov 17 2021, 17:26:17) [GCC 8.3.0] on linux Type "help", "copyright", "credits" or "license" for more information.

from snowflake.sqlalchemy import URL Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.8/site-packages/snowflake/sqlalchemy/init.py", line 25, in from . import base, snowdialect File "/usr/local/lib/python3.8/site-packages/snowflake/sqlalchemy/base.py", line 17, in from .custom_commands import AWSBucket, AzureContainer, ExternalStage File "/usr/local/lib/python3.8/site-packages/snowflake/sqlalchemy/custom_commands.py", line 14, in from sqlalchemy.sql.roles import FromClauseRole ModuleNotFoundError: No module named 'sqlalchemy.sql.roles' ` It looks like there is a bug with SQLAlchemy 1.3.24, which is a dependency for Flask-AppBuilder. See 1621 and 953.

If you have some time spare, can one of you please try to import this package in your local venv and let me know if you face the same issue? @potiuk @mik-laj This package is not being used anywhere else in the entire airflow repository.

mik-laj commented 2 years ago

@harishkrao See: https://github.com/snowflakedb/snowflake-sqlalchemy/issues/234 To use URL class, you need to run:

python -m pip install snowflake-sqlalchemy==1.2.4 sqlalchemy==1.3.24

Unfortunately, Snowflake incorrectly marked the version requirements in one release of snowflake-sqlalchemy and after fixing the problem, they didn't mark the release as yanked, so pip downloads invalid dependency set.

mik-laj commented 2 years ago

I pushed a fix: https://github.com/apache/airflow/pull/20245

harishkrao commented 2 years ago

Thanks very much for letting me know and for pushing the fix @mik-laj :) I will use the above versions. Edit: It worked :+1: Thanks a lot!

harishkrao commented 2 years ago

@mik-laj any suggestions to overcome celery dependency issues? Searched the forum and online and found no open bugs related to it. Happens when running sudo ./breeze initialize-local-virtualenv --python 3.8 and the virtual environment has the celery version of 5.2.1.

from celery import Celery, Task, states as celery_states ModuleNotFoundError: No module named 'celery'

mik-laj commented 2 years ago
AIRFLOW_VERSION=main
PYTHON_VERSION="$(python --version | cut -d " " -f 2 | cut -d "." -f 1-2)"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
pip install celery --constraint "${CONSTRAINT_URL}"

See: https://airflow.apache.org/docs/apache-airflow/stable/installation/installing-from-pypi.html

harishkrao commented 2 years ago

My apologies for asking again, I am contributing to this repo for the first time. Will follow this and install the dependencies. Thank you for the help and patience @mik-laj.

harishkrao commented 2 years ago

Just an info - I am working on this, my laptop crashed, so I have to setup the environment in another laptop and continue, will get back soon.

harishkrao commented 2 years ago

Environment is configured, working on this, will get back in a day or two.

harishkrao commented 2 years ago

Made changes for the fix. Preparing the PR now.

@mik-laj a couple points for clarification. I added a unit test for the fix and it passed. I tried using the snowflake.sqlalchemy.URL, but during running pytest, it caused issues with generating the URL with nested dict values for session_parameters, so I reverted it to use string format, as is currently being used in the mainline branch.

mik-laj commented 2 years ago

@harishkrao Do you need help with preparing a PR?

harishkrao commented 2 years ago

@mik-laj Thank you for offering help, had to attend to something else until now. I will follow the steps in the guide and get back to you soon.

harishkrao commented 2 years ago

Just opened a PR, thank you for reviewing @mik-laj!