Open chandantroughia opened 1 year ago
This looks good to me.. I wonder if Composer has a specific service account that the environment uses that needs BigQuery permissions.
@helensilva14 May have more insight
Hi Team,
On the above, the issue we are facing is with connecting to the source by itself and not just result handler. And we are passing the service account credentials and the composer uses the same and it has access to the BQ tables. So we are not able to figure out the problem.
Also In source connection string we pass below per the Dag sample.
BQ_CONN = {"source-type": "BigQuery", "project-id": project}
And we see option to pass key path for connectivity string as below
# (Optional) BigQuery JSON Config File for On-Prem usecases
"google-service-account-key-path": "/path/to/key.json"
But our project doesn't have a key path. Is there any other way to pass the credentials in the connection string other than json key path or secret manager.
Hi @chandantroughia! Are BQ and Composer in the same GCP projects? Could you please confirm if the SERVICE_ACCOUNT you are providing have at least one of the following BQ roles? Asking that because these roles contain the missing permission bigquery.tables.get
described in the error.
Hi @bharath-cv! What errors are you facing besides the one listed in the first comment? It can help us to have more understanding of the issues you're facing.
Also I'm going to re-create the sample DAG in our environment to see if I can reproduce the error.
Hi @helensilva14
I am sure that the service account has these roles as when we run the DAG to run batch process and query the tables it is able to fetch the results. It is only with DVT we are facing this error.
I am also getting same error as Chandan.
The only difference is our project doesn't have a keypath file defined.
Could you look in Cloud Logging to find the corresponding permission denied entry? That will make it clear which principal is missing the bigquery.tables.get permission, and also let us see if the impersonation is happening correctly. It would be great if you could share the full log text here (after masking any sensitive information such as project IDs etc).
Hi @helensilva14
I am sure that the service account has these roles as when we run the DAG to run batch process and query the tables it is able to fetch the results. It is only with DVT we are facing this error.
I am also getting same error as Chandan.
The only difference is our project doesn't have a keypath file defined.
Hi @bharath-cv, on top of @dmedora comment, can you provide more specifications such as the versions of Airflow, Composer, DVT, Python that you're using? Also could you clarify if the BQ table and Composer are on the same GCP project?
Hi @chandantroughia and @bharath-cv! Have you been able to take a look at the Cloud Logging records or do you have any updates on this case so we can continue to investigate it? If not, could you please try to do it within a week? Otherwise we'll consider to close the issue as "Can't reproduce/Stale". You can always reopen the issue, if needed. Thank you!
Hi @helensilva14
Please find the logs attached. The first log is the log from the DAG. The second one name log2 is the info from the link below
google.api_core.exceptions.Forbidden: 403 GET https://bigquery.googleapis.com/bigquery/v2/projects/****-***-***/datasets/poa_de_hcb_dev/tables/cob_param?prettyPrint=false: Access Denied: Table **-**-:poa_de_hcb_dev.cob_param: Permission bigquery.tables.get denied on table **-**-:poa_de_hcb_dev.cob_param (or it may not exist).
Please take a look and let us know how we can use the DVT tool. Our use case here is to compare data between on-prem (hive) to gcp.
Hi @bharath-cv,
Today, when you pass the impersonated credentials to the BigQueryResultHandler, it would only be used for writing the validation results to BigQuery. If your source or target system is BigQuery, it is currently not possible to pass impersonated credentials to the clients used to execute those queries. We have opened feature request #794 for that. So today, even if you pass impersonated creds to the result handler, the source and target queries are using default creds.
For now, the options are to either provide the appropriate permissions to service account bound that's bound to Composer or provide a service account key file to be used by the job.
In your earlier comment https://github.com/GoogleCloudPlatform/professional-services-data-validator/issues/768#issuecomment-1466824114, you mentioned that the service acc you're using for Composer already has BQ permissions, so we can try to debug that issue further.
Firstly, could you try running this DAG and let us know if it runs successfully? It runs a query using the BQ operator/client directly (instead of via DVT) so we can narrow down the cause of the issue. Note the TODOs indicating values you need to update.
from datetime import datetime
from airflow import models
from airflow.providers.google.cloud.operators.bigquery import BigQueryInsertJobOperator
default_args = {"retries": 0, "owner": "dvt_debug"}
with models.DAG(
"example_bigquery_insert",
schedule_interval="@once",
default_args=default_args,
start_date=datetime(2021, 1, 1),
tags=["dvt_debug"],
) as dag:
insert_job = BigQueryInsertJobOperator(
task_id="insert_query_job",
project_id="your-project", # TODO: edit with your project ID
configuration={
"query": {
"query": "SELECT * FROM `my-project-id.my-dataset.my-table`", # TODO: edit with you query
"useLegacySql": False,
}
},
)
insert_job
Once you've run it, could you also please search in Cloud Logging as requested in my last comment and share the full entry details. You can use this filter:
resource.type="bigquery_resource"
protoPayload.methodName="jobservice.insert"
protoPayload.authenticationInfo.principalEmail="YOUR-COMPOSER-SA@YOUR-PROJECT.iam.gserviceaccount.com"
@bharath-cv Is this still an issue?
Hi @nehanene15 @helensilva14, I am facing the exact same issue.
I was able to run the insert query, but I has to pass the parameter: impersonation_chain = “YOUR-COMPOSER-SA@YOUR-PROJECT.iam.gserviceaccount.com” into BigQueryInsertJobOperator function for a successful run.
I appreciate that a feature request was added. But, is there any work around for this issue for now?
Hi @mylar-pr, thanks for reaching out! Unfortunately, we do not have a workaround for this issue at the moment. Is it possible for you to execute DVT using other services or you only have Composer as option?
Hi Helen,
Thank you for the response. The purpose of the DVT in my case to perform data validation between BQ table and Dataproc Hive tables. Could you please suggest any alternative services/solutions to accomplish this task ?
Thank you,
Regards, Prajwal Mylar
Get Outlook for iOShttps://aka.ms/o0ukef
From: Helen Cristina @.> Sent: Tuesday, September 5, 2023 12:37:37 PM To: GoogleCloudPlatform/professional-services-data-validator @.> Cc: Prajwal Mylar @.>; Mention @.> Subject: Re: [GoogleCloudPlatform/professional-services-data-validator] Impersonation credentials issue - Airflow (Issue #768)
Hi @mylar-prhttps://github.com/mylar-pr, thanks for reaching out! Unfortunately, we do not have a workaround for this issue at the moment. Is it possible for you to execute DVT using other services or you only have Composer as option?
— Reply to this email directly, view it on GitHubhttps://github.com/GoogleCloudPlatform/professional-services-data-validator/issues/768#issuecomment-1707206635, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AOKLUGS6NXARV7BUFCRNWYDXY55QDANCNFSM6AAAAAAVX47VY4. You are receiving this because you were mentioned.Message ID: @.***>
To use Composer, the options are to either provide the appropriate permissions to service account that's bound to Composer or provide a service account key file to be used by the job.
Alternatively, you can use Compute Engine, Cloud Run, or Functions depending on how large of validations you need to perform.
Hi, are there plans to support impersonation for running dvt inside airflow dag? We are also in the same situation and want to use this library and impersonation would be really helpful. Thanks
@moizarafat Can you try the following in Airflow:
impersonated_creds = impersonated_credentials.Credentials(
source_credentials=default_creds,
target_principal=SERVICE_ACCOUNT,
target_scopes=target_scopes,
)
BQ_CONN = {"source_type": "BigQuery", "project_id": project, "credentials": impersonated_creds}
This should connect with the appropriate credentials.
When you create a BQ connection via the CLI, we take the SA key path, but then use google.oauth2.service_account.Credentials.from_service_account_file()
to extract the Credentials object.
You should be able to pass that in directly and it will populate the BQ Client here.
Thanks @nehanene15 that solved it. We had to pass the same credentials to result handler too for bq load of the results.
Hi Team,
I am trying to use this tool to validate some tables in BigQuery. I have am using the code example mentioned here https://github.com/GoogleCloudPlatform/professional-services-data-validator/blob/develop/samples/airflow/dvt_airflow_dag.py
I am getting the below mentioned error:
I am using the correct impersonated credentials. e.g.
Please help me figure out what am I doing wrong?