read data from GCS (using the credential of the impersonated SA)
write data to BigQuery (using the credential of the impersonated SA)
The Dataproc service account has AccessTokenCreator role on the service account to be impersonated (delegated_sa), the delegated_sa has access to GCS and BQ.
Script
...
spark = SparkSession.builder \
.appName("Read CSV from GCS and Write to BigQuery") \
.config('spark.hadoop.fs.gs.auth.impersonation.service.account', delegated_sa) \
.config('gcpImpersonationServiceAccount', delegated_sa) \
.getOrCreate()
...
data = spark.read.format("csv") \
.schema(schema) \
.load(csv)
...
data.write.format('bigquery') \
.option('table', dataset_table) \
.mode('append') \
.save()
...
Error
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:750)
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
GET https://www.googleapis.com/bigquery/v2/projects/dataproc/datasets/tables/customers?prettyPrint=false
{
"code" : 403,
"errors" : [ {
"domain" : "global",
"message" : "Access Denied: Tabledataproc:customers: Permission bigquery.tables.get denied on tabledataproc:customers (or it may not exist).",
"reason" : "accessDenied"
} ],
"message" : "Access Denied: Tabledataproc:dataproc_out.customers: Permission bigquery.tables.get denied on tabledataproc:customers (or it may not exist).",
"status" : "PERMISSION_DENIED"
}
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:118)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:37)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:428)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1111)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getTable(HttpBigQueryRpc.java:284)
... 45 more
Note
The service account impersonification for GCS works properly
If I do service account credentials over json key (.config("credentials", BASE64)) instead of impersonification, it works properly
I am writing a pyspark script on GCP Dataproc to:
The Dataproc service account has AccessTokenCreator role on the service account to be impersonated (delegated_sa), the delegated_sa has access to GCS and BQ.
Script
Error
Note
.config("credentials", BASE64)
) instead of impersonification, it works properly