Closed whittid4 closed 1 month ago
+1 to this , because we hitting a limit of Storage Read API and getting following error - INVALID_ARGUMENT: read_session.read_options.row_restriction exceeded maximum allowed length. Maximum bytes allowed: 1048576
This is how the query and storage APIs are used:
.count()
is called and there's a row filter, a SELECT COUNT(*) FROM <table> WHERE <condition>
query is executed, to speed up calculating the count.The best way to know in retrospective which API has been used is to check the driver's log - the Storage API is identified by the Created ReadSession
lines, the query jobs by the QueryJobConfiguration
lines.
What is the determining factor which tells spark to use the
BigQuery Query API
over theBigQuery Storage API
?Running the following uses the
Storage API
:whereas running this one uses the
BigQuery API
:The second query is obviously much smaller as it is doing a
count
, but I though the.option("query", sql)
(along withMATERIALIZED_DATASET
) is what made it use theBigQuery Query API
, like this?