GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
357 stars 189 forks source link

Does this connector Support Reading data from external table? #365

Closed arunkindra closed 3 years ago

arunkindra commented 3 years ago

I am trying to read an external table using this connector, and I am getting below issue. May I know if there is any plan to support this in near future?

Exception in thread "main" java.lang.UnsupportedOperationException: The type of table <table-name> is currently not supported: EXTERNAL
    at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelationInternal(BigQueryRelationProvider.scala:88)
    at com.google.cloud.spark.bigquery.BigQueryRelationProvider.createRelation(BigQueryRelationProvider.scala:45)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:332)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:242)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:230)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
    at com.gcp.poc.sample.SparkBigQueryConnector.readBigQueryDataset(SparkBigQueryConnector.java:74)
    at com.gcp.poc.sample.SparkBigQueryConnector.main(SparkBigQueryConnector.java:44)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Dependency used

com.google.cloud.spark spark-bigquery-with-dependencies_2.11 0.17.3
emkornfield commented 3 years ago

The backend (Storage Read API) service does not support reading from external tables at this time.

arunkindra commented 3 years ago

Hi @emkornfield Is there any plan to add this support in near future?

emkornfield commented 3 years ago

It is something we are considering on our roadmap. To help track demand for it you can open a feature request in the BQ issue tracker

davidrabinowitz commented 3 years ago

Duplicate of #255

saipraveenpn commented 2 years ago

Hello @davidrabinowitz, facing a different error while trying to query an External BQ table, is there any future plan to support the same?

Caused by: java.lang.UnsupportedOperationException at com.google.cloud.spark.bigquery.ArrowSchemaConverter$ArrowVectorAccessor.getUTF8String(ArrowSchemaConverter.java:313) at com.google.cloud.spark.bigquery.ArrowSchemaConverter.getUTF8String(ArrowSchemaConverter.java:120) at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getUTF8String(MutableColumnarRow.java:135) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithKeys_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:414) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

davidrabinowitz commented 2 years ago

Is the table part of a BigLake?

saipraveenpn commented 2 years ago

No David, this is a regular BQ External table over ORC files on GCS. Just curious to know the support for the same, we'd probably read the ORC directly or through a hive external table with Spark.

Btw, thanks for the mention of BigLake, excited to see google heading to lakehouse!

emkornfield commented 2 years ago

The API only has support for BigLake tables at this point, supporting normal external tables is not something that we will likely immediately support.