GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
378 stars 198 forks source link

What determines the Query API vs. Storage API to be used? #1298

Closed whittid4 closed 1 month ago

whittid4 commented 2 months ago

What is the determining factor which tells spark to use the BigQuery Query API over the BigQuery Storage API?

Running the following uses the Storage API:

table = "bigquery-public-data.london_bicycles.cycle_hire"

df = (
    spark.read.format("bigquery")
    .option("table", table)
    .load()
)

display(df.where("start_date > '2021-01-01'"))

whereas running this one uses the BigQuery API:

table = "bigquery-public-data.london_bicycles.cycle_hire"

df = (
    spark.read.format("bigquery")
    .option("table", table)
    .load()
)

display(df.where("start_date > '2021-01-01'").count())

The second query is obviously much smaller as it is doing a count, but I though the .option("query", sql) (along with MATERIALIZED_DATASET) is what made it use the BigQuery Query API, like this?

sql = "SELECT * FROM `bigquery-public-data.london_bicycles.cycle_hire`"

df = (
    spark.read.format("bigquery")
    .option("materializationProject", materialization_project)
    .option("materializationDataset", materialization_dataset)
    .option("query", sql)
    .load()
)

display(df)
anish97IND commented 1 month ago

+1 to this , because we hitting a limit of Storage Read API and getting following error - INVALID_ARGUMENT: read_session.read_options.row_restriction exceeded maximum allowed length. Maximum bytes allowed: 1048576

davidrabinowitz commented 1 month ago

This is how the query and storage APIs are used:

The best way to know in retrospective which API has been used is to check the driver's log - the Storage API is identified by the Created ReadSession lines, the query jobs by the QueryJobConfiguration lines.