aws / amazon-redshift-python-driver

Redshift Python Connector. It supports Python Database API Specification v2.0.
Apache License 2.0
202 stars 72 forks source link

Cannot hit awsdatacatalog #189

Closed jayceslesar closed 5 months ago

jayceslesar commented 9 months ago

Driver version

2.0.913

Redshift version

PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.54052

Client Operating System

MacOs

Python version

3.10.6

Table schema

N/A

Problem description

  1. Expected behaviour: That I should be able to hit the awsdatacatalog with an IAM role (but cannot)
  2. Actual behaviour: I get an error shown below
  3. Error message/stack trace:
    redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': '42501', 'M': 'Error retrieving columns from the auto-mounted catalog: Current user is not authenticated with IAM credentials.', 'F': '../src/redcat/redcat_utils.cpp', 'L': '409', 'R': 'validateGlueAutoMountConfigParams'}
  4. Any other details that can be helpful: This doc seems to suggest that "To access the awsdatacatalog database from a cluster, you must use the authentication method Temporary credentials using your IAM identity. For more information about this authentication method, see Connecting to an Amazon Redshift database. Your query editor v2 administrator might need to configure the Account settings for the account to display this authentication method on the connection window."

Python Driver trace logs

Reproduction code

I created a role in redshift, granted it access to awsdatacatalog, and am running the following:

from redshift_connector import connect

conn = connect(
    iam=True,
    database="dev",
    user="",
    password="",
    cluster_identifier="dev-redshift-cluster",
    db_user="my_user",
)

cursor = conn.cursor()

cursor.execute("SHOW SCHEMAS FROM DATABASE awsdatacatalog;")

gives:

redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': '42501', 'M': 'Error retrieving columns from the auto-mounted catalog: Current user is not authenticated with IAM credentials.', 'F': '../src/redcat/redcat_utils.cpp', 'L': '409', 'R': 'validateGlueAutoMountConfigParams'}
Brooke-white commented 9 months ago

Hi @jayceslesar ,

Thank you for reaching out.

Current user is not authenticated with IAM credentials

I do not see the AWS temporary IAM credentials being passed to your call to connect(), directly nor via AWS profile. Passing user and password will result in native database authentication, despite setting IAM=True, which does not satisfy the IAM requirement. Please try using AWS profile or passing the IAM credentials directly and let me know if the issue persists. Tutorials can be found here for IAM connections.

jayceslesar commented 9 months ago

Thanks for the fast response, it seems that reinforces/confirms To access the awsdatacatalog database from a cluster, you must use the authentication method Temporary credentials using your IAM identity. For more information about this authentication method, see [Connecting to an Amazon Redshift database](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-using.html#query-editor-v2-connecting). Your query editor v2 administrator might need to configure the Account settings for the account to display this authentication method on the connection window. ? If that is the case, then redshift should probably raise a better error + documentation around doing this type of stuff should be cleaned up and is out of scope for this project.

Brooke-white commented 9 months ago

thanks for the feedback, Jayce. There should be a Provide feedback at the bottom of each AWS docs page where you can provide feedback for the AWS documentation. Closing this issue out, but please let me know if there's anything else I can help with.

graydenshand commented 9 months ago

@Brooke-white I see the same error when specifying temporary credentials as done in the tutorial you linked to.

import redshift_connector
conn = redshift_connector.connect(
    iam=True,
    database='dev',
    db_user='awsuser',
    password='',
    user='',
    cluster_identifier='<redacted>',
    profile='<redacted>',
    auto_create=True
 )

cursor = conn.cursor()

cursor.execute("SHOW SCHEMAS FROM DATABASE awsdatacatalog;")

Results in

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gshand/code/adhoc/redshift_connector/.venv/lib/python3.11/site-packages/redshift_connector/cursor.py", line 241, in execute
    self._c.execute(self, operation, args)
  File "/Users/gshand/code/adhoc/redshift_connector/.venv/lib/python3.11/site-packages/redshift_connector/core.py", line 1819, in execute
    self.handle_messages(cursor)
  File "/Users/gshand/code/adhoc/redshift_connector/.venv/lib/python3.11/site-packages/redshift_connector/core.py", line 2005, in handle_messages
    raise self.error
redshift_connector.error.ProgrammingError: {'S': 'ERROR', 'C': '42501', 'M': 'Error retrieving columns from the auto-mounted catalog: Current user is not authenticated with IAM credentials.', 'F': '../src/redcat/redcat_utils.cpp', 'L': '409', 'R': 'validateGlueAutoMountConfigParams'}

I can also verify that the automounted awsdatacatalog db on the target cluster is able to be accessed via temporary IAM credentials in the Redshift Query Editor V2.

Brooke-white commented 9 months ago

If you check https://docs.aws.amazon.com/redshift/latest/dg/r_STL_CONNECTION_LOG.html what is shown for the entry pertaining to your connection attempt in the authmethod column?

graydenshand commented 9 months ago

It is showing IAM AssumeUser, the same as the working credentials from query editor V2.

So, it seems like this must be something on AWS side. I will open a support ticket with them, and report back if there's anything that might be relevant to this package.

Thanks @Brooke-white, appreciate you looking into this.

Brooke-white commented 9 months ago

Thanks, Grayden. Closing this out but please re-open if needed.

graydenshand commented 9 months ago

Hi @Brooke-white,

I've finally got some useful feedback from AWS support.

For provisioned redshift clusters , you will now need to use get-cluster-credentials-with-iam API. This is the API which need to be used while accessing awsdatacatalog from JDBC.

[+] GetClusterCredentialsWithIAM - https://docs.aws.amazon.com/redshift/latest/APIReference/API_GetClusterCredentialsWithIAM.html

You will need to modify the code to call above API and use the credentials returned using above API call for connection, accordingly for provisioned and serverless redshift types.

It does look like this API call is supported by redshift_connector, https://github.com/aws/amazon-redshift-python-driver/blob/master/redshift_connector/iam_helper.py#L39-L63. However, it seems the way we're invoking the connect method, it's defaulting to the IAM_V1 method.

One other inconsistency I'm seeing is that with the IAM_V2 method (GetClusterCredentialsWithIAM), the database user is set automatically by redshift.

The database user is mapped 1:1 to the source AWS Identity and Access Management (IAM) identity.

So the db_user parameter should be None. However, if I try to leave db_user=None, I receive the following error:

redshift_connector.error.InterfaceError: Connection parameter db_user must be specified when using IAM authentication

Curious to get your thoughts on this!

graydenshand commented 9 months ago

One more update, I've confirmed that the problem I describe above is the root cause of this issue.

The following snippet calls the GetClusterCredentialsWithIAM api directly, and bypasses the redshift_connector IAM auth. It's able to complete the query that was failing originally.

from redshift_connector import connect
import boto3
import json

redshift = boto3.client("redshift")

resp = redshift.get_cluster_credentials_with_iam(
    DbName='dev',
    ClusterIdentifier='<redacted>,
    DurationSeconds=1000,
)

db_user = resp['DbUser']
db_password = resp['DbPassword']

conn = connect(
    database="dev",
    user=db_user,
    password=db_password,
    host="<redacted>"
)

cursor = conn.cursor()

cursor.execute("SHOW SCHEMAS FROM DATABASE awsdatacatalog;")
result = cursor.fetchall()
print(result)

I think as a fix, I would suggest making the following change:

I can open up a draft MR for this change if helpful.

Brooke-white commented 9 months ago

Hi @graydenshand , to use GetClusterCredentialsWithIAM please add the connection parameter group_federation=True. It is required for use of GetClusterCredentialsWithIAM. The IAM_V2 API does not allow the user to provide a user as the IAM_V1 API did. https://docs.aws.amazon.com/redshift/latest/APIReference/API_GetClusterCredentialsWithIAM.html

graydenshand commented 9 months ago

Ah, thanks @Brooke-white! Confirmed that adding group_federation=True fixes the issue.

I would never have guessed to try that based on the description of that parameter:

group_federation (bool): Use the IdP Groups in the Redshift. Default value False

What do you think of my suggestion of defaulting to GetClusterCredentialsWithIAM if iam=True and dbuser is None?

Brooke-white commented 9 months ago

Thanks for that feedback, @graydenshand . I will make a note to improve our docs to give more clear instructions on how users can enable use of the IAM_V2 API.

What do you think of my suggestion of defaulting to GetClusterCredentialsWithIAM if iam=True and dbuser is None?

I believe we default to the V1 API for backwards compatibility reasons, but I will verify with the team and report back.

Brooke-white commented 5 months ago

Apologies for the late response. I've confirmed with the team that this behavior is in place due to backwards compatibility reasons. As of now, we do not have plans to change the implementation.