GoogleCloudPlatform / professional-services-data-validator

Utility to compare data between homogeneous or heterogeneous environments to ensure source and target tables match
Apache License 2.0
406 stars 119 forks source link

Impersonation credentials issue - Airflow #768

Open chandantroughia opened 1 year ago

chandantroughia commented 1 year ago

Hi Team,

I am trying to use this tool to validate some tables in BigQuery. I have am using the code example mentioned here https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/samples/airflow/dvt_airflow_dag.py

I am getting the below mentioned error:

google.api_core.exceptions.Forbidden: 403 GET .... Access Denied: Table project_id:dataset_id.table_name : Permission bigquery.tables.get denied on table project_id:dataset_id.table_name (or it may not exist.)

I am using the correct impersonated credentials. e.g.

project_id = projectId
target_scopes = [
            "https://www.googleapis.com/auth/cloud-platform"
        ]
    default_creds, _ = google.auth.default()
    impersonated_creds = impersonated_credentials.Credentials(
        source_credentials=default_creds,
        target_principal=SERVICE_ACCOUNT,
        target_scopes=target_scopes,
    )
handler = bqhandler.BigQueryResultHandler.get_handler_for_project(project_id = project_id, credentials=impersonated_creds)

Please help me figure out what am I doing wrong?

nehanene15 commented 1 year ago

This looks good to me.. I wonder if Composer has a specific service account that the environment uses that needs BigQuery permissions.

@helensilva14 May have more insight

bharath-cv commented 1 year ago

Hi Team,

On the above, the issue we are facing is with connecting to the source by itself and not just result handler. And we are passing the service account credentials and the composer uses the same and it has access to the BQ tables. So we are not able to figure out the problem.

Also In source connection string we pass below per the Dag sample.

    BQ_CONN = {"source-type": "BigQuery", "project-id": project}

And we see option to pass key path for connectivity string as below

# (Optional) BigQuery JSON Config File for On-Prem usecases
"google-service-account-key-path": "/path/to/key.json"

But our project doesn't have a key path. Is there any other way to pass the credentials in the connection string other than json key path or secret manager.

helensilva14 commented 1 year ago

Hi @chandantroughia! Are BQ and Composer in the same GCP projects? Could you please confirm if the SERVICE_ACCOUNT you are providing have at least one of the following BQ roles? Asking that because these roles contain the missing permission bigquery.tables.get described in the error.

Hi @bharath-cv! What errors are you facing besides the one listed in the first comment? It can help us to have more understanding of the issues you're facing.

Also I'm going to re-create the sample DAG in our environment to see if I can reproduce the error.

bharath-cv commented 1 year ago

Hi @helensilva14

I am sure that the service account has these roles as when we run the DAG to run batch process and query the tables it is able to fetch the results. It is only with DVT we are facing this error.

I am also getting same error as Chandan.

The only difference is our project doesn't have a keypath file defined.

dmedora commented 1 year ago

Could you look in Cloud Logging to find the corresponding permission denied entry? That will make it clear which principal is missing the bigquery.tables.get permission, and also let us see if the impersonation is happening correctly. It would be great if you could share the full log text here (after masking any sensitive information such as project IDs etc).

helensilva14 commented 1 year ago

Hi @helensilva14

I am sure that the service account has these roles as when we run the DAG to run batch process and query the tables it is able to fetch the results. It is only with DVT we are facing this error.

I am also getting same error as Chandan.

The only difference is our project doesn't have a keypath file defined.

Hi @bharath-cv, on top of @dmedora comment, can you provide more specifications such as the versions of Airflow, Composer, DVT, Python that you're using? Also could you clarify if the BQ table and Composer are on the same GCP project?

helensilva14 commented 1 year ago

Hi @chandantroughia and @bharath-cv! Have you been able to take a look at the Cloud Logging records or do you have any updates on this case so we can continue to investigate it? If not, could you please try to do it within a week? Otherwise we'll consider to close the issue as "Can't reproduce/Stale". You can always reopen the issue, if needed. Thank you!

bharath-cv commented 1 year ago

Hi @helensilva14

Please find the logs attached. The first log is the log from the DAG. The second one name log2 is the info from the link below

google.api_core.exceptions.Forbidden: 403 GET https://bigquery.googleapis.com/bigquery/v2/projects/****-***-***/datasets/poa_de_hcb_dev/tables/cob_param?prettyPrint=false: Access Denied: Table **-**-:poa_de_hcb_dev.cob_param: Permission bigquery.tables.get denied on table **-**-:poa_de_hcb_dev.cob_param (or it may not exist).

DVT_airflow_log1.txt

DVT_airflow_log2.txt

Please take a look and let us know how we can use the DVT tool. Our use case here is to compare data between on-prem (hive) to gcp.

dmedora commented 1 year ago

Hi @bharath-cv,

Today, when you pass the impersonated credentials to the BigQueryResultHandler, it would only be used for writing the validation results to BigQuery. If your source or target system is BigQuery, it is currently not possible to pass impersonated credentials to the clients used to execute those queries. We have opened feature request #794 for that. So today, even if you pass impersonated creds to the result handler, the source and target queries are using default creds.

For now, the options are to either provide the appropriate permissions to service account bound that's bound to Composer or provide a service account key file to be used by the job.

In your earlier comment https://github.com/GoogleCloudPlatform/professional-services-data-validator/issues/768#issuecomment-1466824114, you mentioned that the service acc you're using for Composer already has BQ permissions, so we can try to debug that issue further.

Firstly, could you try running this DAG and let us know if it runs successfully? It runs a query using the BQ operator/client directly (instead of via DVT) so we can narrow down the cause of the issue. Note the TODOs indicating values you need to update.

from datetime import datetime
from airflow import models
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator

default_args = {"retries": 0, "owner": "dvt_debug"}

with models.DAG(
    "example_bigquery_insert",
    schedule_interval="@once",
    default_args=default_args,
    start_date=datetime(2021, 1, 1),
    tags=["dvt_debug"],
) as dag:
    insert_job = BigQueryInsertJobOperator(
        task_id="insert_query_job",
        project_id="your-project", # TODO: edit with your project ID
        configuration={
            "query": {
                "query": "SELECT * FROM `my-project-id.my-dataset.my-table`",  # TODO: edit with you query
                "useLegacySql": False,
            }
        },
    )

    insert_job

Once you've run it, could you also please search in Cloud Logging as requested in my last comment and share the full entry details. You can use this filter:

resource.type="bigquery_resource"
protoPayload.methodName="jobservice.insert"
protoPayload.authenticationInfo.principalEmail="YOUR-COMPOSER-SA@YOUR-PROJECT.iam.gserviceaccount.com"
nehanene15 commented 1 year ago

@bharath-cv Is this still an issue?

mylar-pr commented 1 year ago

Hi @nehanene15 @helensilva14, I am facing the exact same issue.

I was able to run the insert query, but I has to pass the parameter: impersonation_chain = “YOUR-COMPOSER-SA@YOUR-PROJECT.iam.gserviceaccount.com” into BigQueryInsertJobOperator function for a successful run.

I appreciate that a feature request was added. But, is there any work around for this issue for now?

helensilva14 commented 1 year ago

Hi @mylar-pr, thanks for reaching out! Unfortunately, we do not have a workaround for this issue at the moment. Is it possible for you to execute DVT using other services or you only have Composer as option?

mylar-pr commented 1 year ago

Hi Helen,

Thank you for the response. The purpose of the DVT in my case to perform data validation between BQ table and Dataproc Hive tables. Could you please suggest any alternative services/solutions to accomplish this task ?

Thank you,

Regards, Prajwal Mylar

Get Outlook for iOShttps://aka.ms/o0ukef


From: Helen Cristina @.> Sent: Tuesday, September 5, 2023 12:37:37 PM To: GoogleCloudPlatform/professional-services-data-validator @.> Cc: Prajwal Mylar @.>; Mention @.> Subject: Re: [GoogleCloudPlatform/professional-services-data-validator] Impersonation credentials issue - Airflow (Issue #768)

Hi @mylar-prhttps://github.com/mylar-pr, thanks for reaching out! Unfortunately, we do not have a workaround for this issue at the moment. Is it possible for you to execute DVT using other services or you only have Composer as option?

— Reply to this email directly, view it on GitHubhttps://github.com/GoogleCloudPlatform/professional-services-data-validator/issues/768#issuecomment-1707206635, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AOKLUGS6NXARV7BUFCRNWYDXY55QDANCNFSM6AAAAAAVX47VY4. You are receiving this because you were mentioned.Message ID: @.***>

nehanene15 commented 1 year ago

To use Composer, the options are to either provide the appropriate permissions to service account that's bound to Composer or provide a service account key file to be used by the job.

Alternatively, you can use Compute Engine, Cloud Run, or Functions depending on how large of validations you need to perform.

moizarafat commented 5 months ago

Hi, are there plans to support impersonation for running dvt inside airflow dag? We are also in the same situation and want to use this library and impersonation would be really helpful. Thanks

nehanene15 commented 5 months ago

@moizarafat Can you try the following in Airflow:

impersonated_creds = impersonated_credentials.Credentials(
    source_credentials=default_creds,
    target_principal=SERVICE_ACCOUNT,
    target_scopes=target_scopes,
)
BQ_CONN = {"source_type": "BigQuery", "project_id": project, "credentials": impersonated_creds}

This should connect with the appropriate credentials. When you create a BQ connection via the CLI, we take the SA key path, but then use google.oauth2.service_account.Credentials.from_service_account_file() to extract the Credentials object.

You should be able to pass that in directly and it will populate the BQ Client here.

moizarafat commented 5 months ago

Thanks @nehanene15 that solved it. We had to pass the same credentials to result handler too for bq load of the results.