GoogleCloudDataproc / hive-bigquery-connector

A library enabling BigQuery as Hive storage handler
Apache License 2.0
9 stars 15 forks source link

[README.md] Missing documentation on IAM permissions required to run in Dataproc #102

Open eyalbenivri opened 1 year ago

eyalbenivri commented 1 year ago

When setting up a Dataproc cluster with the default Service Account, there is an error to create a table:

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. com.google.cloud.hive.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table eyalbenivri-playground:hivetest.nflplayers: Permission bigquery.tables.get denied on table eyalbenivri-playground:hivetest.nflplayers (or it may not exist).

Then, adding the BigQuery Data Editor role to the service account, the operation succeeds, but I can't select from the table:

hive> select * from  nflplayers limit 10;
OK
Failed with exception java.io.IOException:com.google.cloud.hive.bigquery.repackaged.com.google.api.gax.rpc.PermissionDeniedException: com.google.cloud.hive.bigquery.repackaged.io.grpc.StatusRuntimeException: PERMISSION_DENIED: request failed: the user does not have 'bigquery.readsessions.create' permission for 'projects/eyalbenivri-playground'

I solved the issue using the bad practice of giving the SA the BigQuery Admin role, which obviously works, but is not recommended. A minimum list of roles to assign to the Dataproc SA, would help setup clusters using the proper best practices.

jphalip commented 1 year ago

Thanks for the report. Could you try giving it the roles/bigquery.readSessionUser role instead of BQ Admin and see if that's gives it enough permissions?

sharmavarun1108 commented 4 months ago

Need help with following permission issue.

Can you please help me list all the access needed to make this work besides this: roles/bigquery.readSessionUser

Error:

py4j.protocol.Py4JJavaError: An error occurred while calling o129.load.
: com.google.cloud.bigquery.connector.common.BigQueryConnectorException: Error creating destination table using the following query: [SELECT * FROM url360.domain_attr_bq_ext WHERE DATE(dt) = DATE('2024-05-03')]
   at com.google.cloud.bigquery.connector.common.BigQueryClient.materializeTable(BigQueryClient.java:491)
   at com.google.cloud.bigquery.connector.common.BigQueryClient.materializeQueryToTable(BigQueryClient.java:431)
   at com.google.cloud.bigquery.connector.common.BigQueryClient.getReadTable(BigQueryClient.java:250)
   at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:77)
   at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:46)
   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
   at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
   at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
   at scala.Option.getOrElse(Option.scala:189)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:171)
   at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   at py4j.Gateway.invoke(Gateway.java:282)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
   at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.common.util.concurrent.UncheckedExecutionException: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2055)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache.get(LocalCache.java:3966)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4863)
   at com.google.cloud.bigquery.connector.common.BigQueryClient.materializeTable(BigQueryClient.java:479)
   ... 22 more
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryException: Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.translate(HttpBigQueryRpc.java:115)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.spi.v2.HttpBigQueryRpc.getQueryResults(HttpBigQueryRpc.java:694)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$36.call(BigQueryImpl.java:1410)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl$36.call(BigQueryImpl.java:1405)
   at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.run(BigQueryRetryHelper.java:86)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.runWithRetries(BigQueryRetryHelper.java:49)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1404)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryImpl.getQueryResults(BigQueryImpl.java:1388)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job$1.call(Job.java:338)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job$1.call(Job.java:335)
   at com.google.cloud.spark.bigquery.repackaged.com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:103)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.run(BigQueryRetryHelper.java:86)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.BigQueryRetryHelper.runWithRetries(BigQueryRetryHelper.java:49)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job.waitForQueryResults(Job.java:334)
   at com.google.cloud.spark.bigquery.repackaged.com.google.cloud.bigquery.Job.waitFor(Job.java:244)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.waitForJob(BigQueryClient.java:691)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.createTableFromQuery(BigQueryClient.java:675)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.call(BigQueryClient.java:662)
   at com.google.cloud.bigquery.connector.common.BigQueryClient$DestinationTableBuilder.call(BigQueryClient.java:637)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4868)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3533)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2282)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2159)
   at com.google.cloud.spark.bigquery.repackaged.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2049)
   ... 25 more
Caused by: com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException: 403 Forbidden
GET https://www.googleapis.com/bigquery/v2/projects/blah-p-myapps/queries/b0491fe9-b1e7-4190-a288-14cc6ddc0a95?location=US&maxResults=0&prettyPrint=false
{
  "code" : 403,
  "errors" : [ {
    "domain" : "global",
    "message" : "Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.",
    "reason" : "accessDenied"
  } ],
  "message" : "Access Denied: Table blah-p-myapps:url360.domain_attr_bq_ext: User does not have permission to query table blah-p-myapps:url360.domain_attr_bq_ext, or perhaps it does not exist in location US.",
  "status" : "PERMISSION_DENIED"
}
   at com.google.cloud.spark.bigquery.repackaged.com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:146)
jphalip commented 4 months ago

@sharmavarun1108 It looks like you are running Spark SQL. Is that correct? Could you provide a bit more context about the environment? For example, are you running your code in Dataproc or in a self-managed Hadoop/Hive cluster? Are you able to share any code snippets? Thanks.

jphalip commented 4 months ago

@sharmavarun1108 One more thing. By chance are you reading a BigQuery view? If so, you'll also need to give your service account permissions to create tables. This is because the connector needs to copy data from the view to a regular table in order to read it. See more details here: https://github.com/GoogleCloudDataproc/hive-bigquery-connector?tab=readme-ov-file#reading-from-bigquery-views-and-materialized-views

Please let us know if that works.

sharmavarun1108 commented 4 months ago

Hi @jphalip , Thank you for your prompt response. I believe I have identified the root cause of the issue.

Upon attempting to access a Hive external table mounted on BQ, I utilized the spark.read.format("bigquery") method, which is designed for direct BQ table reading as outlined in the documentation provided here: https://github.com/GoogleCloudDataproc/spark-bigquery-connector/

However, I encountered significant delays in data retrieval when using this functionality, particularly when accessing BQ data through Hive running on DataProc 2.1. The prolonged wait times, sometimes exacerbated by BQ slot contention, result in excessive billing costs for idle Spark clusters waiting on BQ to return data.

Regrettably, due to these challenges, it appears impractical to proceed with building a datalake using this approach within our company.

As a workaround, we will revert to the less than ideal solution of copying BQ tables to Hive tables in ORC format stored in GCS. This method, executed via Hive/Tez, mitigates BQ slot contention issues and enables smoother data copying operations.

Thank you for your understanding, and I appreciate your guidance on next steps.

jphalip commented 4 months ago

Hi @sharmavarun1108. Thanks for the feedback. I'd love to learn more about your use case and see if we can help mitigate the issues you've run into. Please get in touch (jphalip@google.com) and we could set up a quick chat if you're interested. Thanks!