What determines the Query API vs. Storage API to be used?

whittid4 commented 2 months ago

What is the determining factor which tells spark to use the BigQuery Query API over the BigQuery Storage API?

Running the following uses the Storage API:

table = "bigquery-public-data.london_bicycles.cycle_hire"

df = (
    spark.read.format("bigquery")
    .option("table", table)
    .load()
)

display(df.where("start_date > '2021-01-01'"))

whereas running this one uses the BigQuery API:

table = "bigquery-public-data.london_bicycles.cycle_hire"

df = (
    spark.read.format("bigquery")
    .option("table", table)
    .load()
)

display(df.where("start_date > '2021-01-01'").count())

The second query is obviously much smaller as it is doing a count, but I though the .option("query", sql) (along with MATERIALIZED_DATASET) is what made it use the BigQuery Query API, like this?

sql = "SELECT * FROM `bigquery-public-data.london_bicycles.cycle_hire`"

df = (
    spark.read.format("bigquery")
    .option("materializationProject", materialization_project)
    .option("materializationDataset", materialization_dataset)
    .option("query", sql)
    .load()
)

display(df)

anish97IND commented 1 month ago

+1 to this , because we hitting a limit of Storage Read API and getting following error - INVALID_ARGUMENT: read_session.read_options.row_restriction exceeded maximum allowed length. Maximum bytes allowed: 1048576

davidrabinowitz commented 1 month ago

This is how the query and storage APIs are used:

Every time a data is read to a data frame, it is done by the storage API, as it allows us to read data in a distributed manner. This let Spark read each partition independently on different executor.
When reading from a query or a view, a query job is created. The query saves its results into a short lived table (whose data is read by the storage API)
When .count() is called and there's a row filter, a SELECT COUNT(*) FROM <table> WHERE <condition> query is executed, to speed up calculating the count.

The best way to know in retrospective which API has been used is to check the driver's log - the Storage API is identified by the Created ReadSession lines, the query jobs by the QueryJobConfiguration lines.

GoogleCloudDataproc / spark-bigquery-connector

What determines the Query API vs. Storage API to be used? #1298