GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
378 stars 198 forks source link

Big Query Full Table Read - Some Partitions take longer bqScanTime as compared to other partitions with similar number of records #1262

Closed josepaul1988 closed 4 months ago

josepaul1988 commented 4 months ago

I am trying to read a large, big query table with 95 million records and 400+ columns. I am using bqChannelPoolSize=15, with 6 Executors with 11 cores each. I could see, from Spark UI, that there are 200 Tasks created for the Read Session with more or less equal data (500K records each on an avg), however actual read for certain partitions take much longer (many times of the execution time needed for other similar partitions). Because of these few threads which are taking more time, my job is going beyond 6 hours and getting Read Session Timed out error. I have checked the GCP Logging Service with the Read Session id, but couldn't find anything unusual there, in fact no entries were present for that time frame for the session id.

How can i debug this further? Is there any table level optimizations that i can do for the full table reads (also this happens for the other queries where we are getting just 4 partitions,3 partitions will get finished under 3 minutes and last one will take 3 hours to read same number of records as that of other partitions)

Technical Details PySpark: 3.0.2, OnPremise Cluster on K8S Big Query Spark Connector: spark-3.1-bigquery-0.36.1 bqChannelPoolSize=15 maxParallelism received: 200 Spark Conf: 6 Executors with 11 cores

kmjung commented 4 months ago

Can you share session IDs for example sessions? These should be in the Spark driver logs, and are strings of the format "projects/.../locations/.../sessions/...".

josepaul1988 commented 4 months ago

@kmjung : Unfortunately i cannot share that info as the issue is in Production Big Query Tables and may need to take extra approvals. Is there a way i can debug the issue any further?

kmjung commented 4 months ago

It's going to be tough for me to debug without any additional logs here. If you're open to filing a support case with GCP, that might be the easiest way to share this information securely with relevant experts from the BigQuery and Dataproc teams.

josepaul1988 commented 4 months ago

Thank you @kmjung. I will do that.