apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.82k stars 14.24k forks source link

Redshift connection breaking change with IAM authentication #31551

Closed sunank200 closed 1 year ago

sunank200 commented 1 year ago

Apache Airflow version

2.6.1

What happened

This PR introduced the get_iam_token method in redshift_sql.py. This is the breaking change as introduces the check for iam in extras, and it's set to False by default.

Error log:


self = <airflow.providers.amazon.aws.hooks.redshift_sql.RedshiftSQLHook object at 0x7f29f7c208e0>
conn = redshift_default

    def get_iam_token(self, conn: Connection) -> tuple[str, str, int]:
        """
        Uses AWSHook to retrieve a temporary ***word to connect to Redshift.
        Port is required. If none is provided, default is used for each service
        """
        port = conn.port or 5439
        # Pull the custer-identifier from the beginning of the Redshift URL
        # ex. my-cluster.ccdre4hpd39h.us-east-1.redshift.amazonaws.com returns my-cluster
>       cluster_identifier = conn.extra_dejson.get("cluster_identifier", conn.host.split(".")[0])
E       AttributeError: 'NoneType' object has no attribute 'split'

.nox/test-3-8-airflow-2-6-0/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/redshift_sql.py:107: AttributeError

What you think should happen instead

It should have backward compatibility

How to reproduce

Run an example DAG for redshift with the AWS IAM profile given at hook initialization to retrieve a temporary password to connect to Amazon Redshift.

Operating System

mac-os

Versions of Apache Airflow Providers

No response

Deployment

Astronomer

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

Code of Conduct

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

phanikumv commented 1 year ago

Upgrading to the latest amazon provider broke my existing DAG which uses redshift connection , I would consider it as a bug which would need quick resolution

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/functools.py:967: in __get__
    val = self.func(instance)
src/astro/databases/aws/redshift.py:107: in sqlalchemy_engine
    uri = self.hook.get_uri()
.nox/test-3-8-airflow-2-6-0/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/redshift_sql.py:122: in get_uri
    conn_params = self._get_conn_params()
.nox/test-3-8-airflow-2-6-0/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/redshift_sql.py:84: in _get_conn_params
    conn.login, conn.***word, conn.port = self.get_iam_token(conn)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <airflow.providers.amazon.aws.hooks.redshift_sql.RedshiftSQLHook object at 0x7f29f7c208e0>
conn = redshift_default

    def get_iam_token(self, conn: Connection) -> tuple[str, str, int]:
        """
        Uses AWSHook to retrieve a temporary ***word to connect to Redshift.
        Port is required. If none is provided, default is used for each service
        """
        port = conn.port or 5439
        # Pull the custer-identifier from the beginning of the Redshift URL
        # ex. my-cluster.ccdre4hpd39h.us-east-1.redshift.amazonaws.com returns my-cluster
>       cluster_identifier = conn.extra_dejson.get("cluster_identifier", conn.host.split(".")[0])
E       AttributeError: 'NoneType' object has no attribute 'split'
dstandish commented 1 year ago

i am curious how user encountered this problem. it looks like "iam": True was a newly added extra param, and that code doesn't get invoked unless you have it as True in the connection extra

hussein-awala commented 1 year ago

This PR introduced the get_iam_token method in redshift_sql.py. This is the breaking change as introduces the check for iam in extras, and it's set to False by default.

I agree with @dstandish, this change is b/c, where the method is not invoked if you don't provide the "iam" extra with a True value. This extra is new for this hook, but it's used in the MySql and postgres hooks to connect to rds (or redshift in postgres hook), so if you have this extra for one of these two hooks, you can duplicate the connection and remove this extra from the connection used by the redshift hook.

phanikumv commented 1 year ago

The "iam" parameter is not being set to false by default - https://github.com/apache/airflow/blob/5ae9728db7d34d287907ca3a919ac1a94c776799/airflow/utils/db.py#L551

dstandish commented 1 year ago

that's interesting, and, maybe not the best choice (particularly given that it looks like it would blow up, since host is not populated), but it's just an example connection. and, looks like it was probably released at same time as this [new] feature.

don't get me wrong, i think we should make the code better, but just doesn't seem, strictly speaking, to be a breaking change -- whatever that's worth.

eladkal commented 1 year ago

I'm removing the priority label. Issues that are locallized to a specific operator in a specific provider are not considered urgent as the radius of the issue is not big.

cc @o-nikolas @shubham22 @cjames23 WDYT about the issue?

sunank200 commented 1 year ago

@dstandish @hussein-awala @phanikumv This is a bug in the current implementation as per the PR.

When user passes the following as part of redshift connection, it would break:

{
  "iam": true,
  "cluster_identifier": "redshift-cluster-1",
  "port": 5439,
  "region": "us-east-1",
  "db_user": "awsuser",
  "database": "dev",
  "profile": "default"
}

The following image shows that the host is not set in the connection but cluster_identifier is set. Screenshot 2023-05-26 at 3 56 58 PM

Even though the cluster_identifier is set here this would break with the following error on the line as host is not set because conn.host is evaluated first and when we split it to NoneType object it would throw attribute error.

conn.extra_dejson.get("cluster_identifier", conn.host.split(".")[0])

Screenshot 2023-05-26 at 3 58 00 PM

Following is the test we did.

>>> conn.extra_dejson.get("cluster_identifier")
PyDev console: starting.
'<REDSHIFT_CLUSTER_IDENTIFIER>'
>>> conn.extra_dejson.get("cluster_identifier", None)
'<REDSHIFT_CLUSTER_IDENTIFIER>'
>>> conn.host

>>> conn.extra_dejson.get("cluster_identifier", conn.host.split(".")[0])
Traceback (most recent call last):
  File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
    exec(exp, global_vars, local_vars)
  File "<input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'split'

Thanks @pankajkoti for your support with the connection details we use in astro-sdk and debugging.

Conclusion:

This is a bug in the current implementation for all users when host is not set in the connection (not only for default connections). This would require an null check for conn.host.

hussein-awala commented 1 year ago

Indeed, conn.host.split(".")[0] is compiled regardless the extra values. I will fix it.

phanikumv commented 1 year ago

@sunank200 is already working on fixing this !

sunank200 commented 1 year ago

@hussein-awala I am already working on this PR

hussein-awala commented 1 year ago

:ok_hand:

pankajkoti commented 1 year ago

yes, along with being a bug, I will consider it as a breaking change as users like us would not have set host but set the connection extras cluster_identidifer: <cluster_identifier> together withiam: true, which would be needed for such configuration and it breaks existing running DAGs.

hussein-awala commented 1 year ago

yes, along with being a bug, I will consider it as a breaking change as users like us would not have set host but set the connection extras cluster_identidifer: together with iam: true, which would be needed for such configuration and it breaks existing running DAGs.

I don't agree, IMO this is just a new broken feature, setting iam to false or removing it is sufficient to deactivate the feature and avoid the bug.

pankajkoti commented 1 year ago

yes, along with being a bug, I will consider it as a breaking change as users like us would not have set host but set the connection extras cluster_identidifer: together with iam: true, which would be needed for such configuration and it breaks existing running DAGs.

I don't agree, IMO this is just a new broken feature, setting iam to false or removing it is sufficient to deactivate the feature and avoid the bug.

If the new feature is broken and does not have an effect outside the scope of the feature, I believe it's fine. But it is affecting existing running DAGs too because the earlier setup connection has iam:true. Here existing configuration is becoming invalid and is producing unexpected results due to the change in the PR.

Curious to understand why having to alter the connection/configuration by having to set iam: false is not a breaking change by itself.

Also, setting iam:false may disable this feature but has wider implications. I would like to continue using IAM authentication to connect to the cluster https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/connections/redshift.html#examples

hussein-awala commented 1 year ago

@pankajkoti breaking change is breaking something that works correctly before the change, in our case, if someone has an old redshift connection, it will be without the extra iam, so the new change will not affect his dags, but if he init the db with the conf database.LOAD_DEFAULT_CONNECTIONS, this command will create a new connection with iam extra set to true, and the dags will be broken, but since this is a new cluster, logically, this issue will be considered as a bug rather than a breaking change. WDYT?

pankajkoti commented 1 year ago

@hussein-awala okay thanks. For us, it's an automated script which creates connection. But let me test it sometime soon across provider versions that if there is an existing connection with iam:true with a non-default connection ID whether it works or not.

The iam:true is required by redshift_connector library and the field was getting used before too I believe and it's not a new one introduced as part of this feature.