GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
358 stars 189 forks source link

BigQuery Storage API always returning 200 partitions #1237

Closed rcanzanese closed 2 weeks ago

rcanzanese commented 1 month ago

I'm using preferredMinParallelism and maxParallelism successfully, but no matter what I do, I always end up with 200 partitions, regardless of how big the underlying table is -- I've tried with tables as big as 4TiB with the same result.

spark:spark.datasource.bigquery.preferredMinParallelism: "33333"
spark:spark.datasource.bigquery.maxParallelism: "33333"

The message I receive with the following settings is:

Requested 33333 max partitions, but only received 200 from the BigQuery Storage API for session 

Is there some additional config that I am missing?

isha97 commented 2 weeks ago

Hi @rcanzanese ,
Actual number of partitions may be less than the preferredMinParallelism if BigQuery deems the data small enough. There are quotas on number of partitions per read session as well which restricts the parallelism. Please file a bug with support on increasing the quota for your project.