Closed sunank200 closed 1 year ago
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.
Upgrading to the latest amazon provider broke my existing DAG which uses redshift connection , I would consider it as a bug which would need quick resolution
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/functools.py:967: in __get__
val = self.func(instance)
src/astro/databases/aws/redshift.py:107: in sqlalchemy_engine
uri = self.hook.get_uri()
.nox/test-3-8-airflow-2-6-0/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/redshift_sql.py:122: in get_uri
conn_params = self._get_conn_params()
.nox/test-3-8-airflow-2-6-0/lib/python3.8/site-packages/airflow/providers/amazon/aws/hooks/redshift_sql.py:84: in _get_conn_params
conn.login, conn.***word, conn.port = self.get_iam_token(conn)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <airflow.providers.amazon.aws.hooks.redshift_sql.RedshiftSQLHook object at 0x7f29f7c208e0>
conn = redshift_default
def get_iam_token(self, conn: Connection) -> tuple[str, str, int]:
"""
Uses AWSHook to retrieve a temporary ***word to connect to Redshift.
Port is required. If none is provided, default is used for each service
"""
port = conn.port or 5439
# Pull the custer-identifier from the beginning of the Redshift URL
# ex. my-cluster.ccdre4hpd39h.us-east-1.redshift.amazonaws.com returns my-cluster
> cluster_identifier = conn.extra_dejson.get("cluster_identifier", conn.host.split(".")[0])
E AttributeError: 'NoneType' object has no attribute 'split'
i am curious how user encountered this problem. it looks like "iam": True
was a newly added extra param, and that code doesn't get invoked unless you have it as True in the connection extra
This PR introduced the get_iam_token method in redshift_sql.py. This is the breaking change as introduces the check for iam in extras, and it's set to False by default.
I agree with @dstandish, this change is b/c, where the method is not invoked if you don't provide the "iam" extra with a True value. This extra is new for this hook, but it's used in the MySql and postgres hooks to connect to rds (or redshift in postgres hook), so if you have this extra for one of these two hooks, you can duplicate the connection and remove this extra from the connection used by the redshift hook.
The "iam" parameter is not being set to false by default - https://github.com/apache/airflow/blob/5ae9728db7d34d287907ca3a919ac1a94c776799/airflow/utils/db.py#L551
that's interesting, and, maybe not the best choice (particularly given that it looks like it would blow up, since host is not populated), but it's just an example connection. and, looks like it was probably released at same time as this [new] feature.
don't get me wrong, i think we should make the code better, but just doesn't seem, strictly speaking, to be a breaking change -- whatever that's worth.
I'm removing the priority label. Issues that are locallized to a specific operator in a specific provider are not considered urgent as the radius of the issue is not big.
cc @o-nikolas @shubham22 @cjames23 WDYT about the issue?
@dstandish @hussein-awala @phanikumv This is a bug in the current implementation as per the PR.
When user passes the following as part of redshift connection, it would break:
{
"iam": true,
"cluster_identifier": "redshift-cluster-1",
"port": 5439,
"region": "us-east-1",
"db_user": "awsuser",
"database": "dev",
"profile": "default"
}
The following image shows that the host
is not set in the connection but cluster_identifier
is set.
Even though the cluster_identifier
is set here this would break with the following error on the line as host is not set because conn.host
is evaluated first and when we split it to NoneType
object it would throw attribute error.
conn.extra_dejson.get("cluster_identifier", conn.host.split(".")[0])
Following is the test we did.
>>> conn.extra_dejson.get("cluster_identifier")
PyDev console: starting.
'<REDSHIFT_CLUSTER_IDENTIFIER>'
>>> conn.extra_dejson.get("cluster_identifier", None)
'<REDSHIFT_CLUSTER_IDENTIFIER>'
>>> conn.host
>>> conn.extra_dejson.get("cluster_identifier", conn.host.split(".")[0])
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydevd_bundle/pydevd_exec2.py", line 3, in Exec
exec(exp, global_vars, local_vars)
File "<input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'split'
Thanks @pankajkoti for your support with the connection details we use in astro-sdk and debugging.
This is a bug in the current implementation for all users when host
is not set in the connection (not only for default connections). This would require an null check for conn.host
.
Indeed, conn.host.split(".")[0]
is compiled regardless the extra values.
I will fix it.
@sunank200 is already working on fixing this !
@hussein-awala I am already working on this PR
:ok_hand:
yes, along with being a bug, I will consider it as a breaking change as users like us would not have set host
but set the connection extras cluster_identidifer: <cluster_identifier>
together withiam: true
, which would be needed for such configuration and it breaks existing running DAGs.
yes, along with being a bug, I will consider it as a breaking change as users like us would not have set host but set the connection extras cluster_identidifer:
together with iam: true, which would be needed for such configuration and it breaks existing running DAGs.
I don't agree, IMO this is just a new broken feature, setting iam
to false or removing it is sufficient to deactivate the feature and avoid the bug.
yes, along with being a bug, I will consider it as a breaking change as users like us would not have set host but set the connection extras cluster_identidifer:
together with iam: true, which would be needed for such configuration and it breaks existing running DAGs. I don't agree, IMO this is just a new broken feature, setting
iam
to false or removing it is sufficient to deactivate the feature and avoid the bug.
If the new feature is broken and does not have an effect outside the scope of the feature, I believe it's fine. But it is affecting existing running DAGs too because the earlier setup connection has iam:true
. Here existing configuration is becoming invalid and is producing unexpected results due to the change in the PR.
Curious to understand why having to alter the connection/configuration by having to set iam: false
is not a breaking change by itself.
Also, setting iam:false
may disable this feature but has wider implications. I would like to continue using IAM authentication to connect to the cluster https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/connections/redshift.html#examples
@pankajkoti breaking change is breaking something that works correctly before the change, in our case, if someone has an old redshift connection, it will be without the extra iam
, so the new change will not affect his dags, but if he init the db with the conf database.LOAD_DEFAULT_CONNECTIONS
, this command will create a new connection with iam
extra set to true, and the dags will be broken, but since this is a new cluster, logically, this issue will be considered as a bug rather than a breaking change. WDYT?
@hussein-awala okay thanks. For us, it's an automated script which creates connection. But let me test it sometime soon across provider versions that if there is an existing connection with iam:true with a non-default connection ID whether it works or not.
The iam:true is required by redshift_connector library and the field was getting used before too I believe and it's not a new one introduced as part of this feature.
Apache Airflow version
2.6.1
What happened
This PR introduced the get_iam_token method in
redshift_sql.py
. This is the breaking change as introduces the check foriam
in extras, and it's set to False by default.Error log:
What you think should happen instead
It should have backward compatibility
How to reproduce
Run an example DAG for redshift with the AWS IAM profile given at hook initialization to retrieve a temporary password to connect to Amazon Redshift.
Operating System
mac-os
Versions of Apache Airflow Providers
No response
Deployment
Astronomer
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
Code of Conduct