GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
358 stars 189 forks source link

Storage Read API logging #1204

Closed kristopherkane closed 2 months ago

kristopherkane commented 3 months ago

Howdy all.

Are there any debug logging options available to get duration timing of a .load() read session on connector version 0.28.1?

davidrabinowitz commented 3 months ago

Can you please elaborate what are your requirements?

kristopherkane commented 3 months ago

We are diagnosing variable job runtime and are looking at a Spark job that does a large read from BigQuery. It is difficult to tell how long the isolated portion of the BigQuery read takes since a stage containing the read might also include something like a broadcast join so the plan view in the Spark History UI doesn't always represent just the BQ portion.

Correction on the connector version, it is 0.30.

The Spark driver logs this by default and I was looking for some other options.

24/03/27 03:31:44 INFO DirectBigQueryRelation: |Querying table xyz.123, parameters sent from Spark:|requiredColumns=[<column>],|filters=[] 24/03/27 03:31:46 INFO ReadSessionCreator: Read session:{"readSessionName":"projects/xyz","readSessionCreationStartTime":"2024-03-27T03:31:44.470062Z","readSessionCreationEndTime":"2024-03-27T03:31:46.047985Z","readSessionPrepDuration":740,"readSessionCreationDuration":837,"readSessionDuration":1577} 24/03/27 03:31:46 INFO ReadSessionCreator: Requested 20000 max partitions, but only received 2 from the BigQuery Storage API for session xyz.123. Notice that the number of streams in actual may be lower than the requested number, depending on the amount parallelism that is reasonable for the table and the maximum amount of parallelism allowed by the system. 24/03/27 03:31:46 INFO BigQueryRDDFactory: Created read session for table 'xyz.123': xyz.123

I don't think readSessionDuration represents the actual time in BQ retrieval. Looks like there has been a lot of work around this recently.

davidrabinowitz commented 3 months ago

Are you using filers? Can you please upgrade to version 0.37.0 ? Also, switching to the latest flaor of the connector (spark-3.x-bigquery) may help

kristopherkane commented 3 months ago

Some queries use filters, perhaps most?

A BQ upgrade is on the horizon, I think there is a breaking decimal change sometime after .30 that I haven't looked closely at yet.

Just to be clear, there's nothing on .30 that I can change to DEBUG in a logging config for more activity timing?

kristopherkane commented 3 months ago

https://github.com/GoogleCloudDataproc/spark-bigquery-connector?tab=readme-ov-file#connector-metrics-and-how-to-view-them

Looks pretty good if we can get there.