GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
373 stars 194 forks source link

Biqquery changes not reflecting in the Streaming Pipeline #968

Closed nishant-neo closed 4 months ago

nishant-neo commented 1 year ago

I have a spark-structured streaming pipeline that triggers from a Kafka queue and enriching these events with a lookup from an underlying bigquery table (bqt) (this table in the bigquery is also getting updated, via a dataflow job)

Let's say when I deploy the above pipeline at time=t, there were 100 rows in bqt At time=t', #rows in bqt have increased to 120 rows, but my streaming pipeline is still seeing 100 rows, once I redeploy it, after time t', it starts seeing 120 rows.

Is this behavior expected?

(The whole point of using bq over files was reading the data dynamically)

davidrabinowitz commented 1 year ago

Yes, the behavior is expected. When a DataFrame is created, it is tied to a BigQuery Storage ReadSession, which looks at a snapshot of the table once created.

How often do you need the data to be refreshed? After each period, do you care for the entire table or just for the added rows?

nishant-neo commented 1 year ago

When a DataFrame is created, it is tied to a BigQuery Storage ReadSession, which looks at a snapshot of the table once created.

How can I reload this snapshot of the DataFrame, in the runtime?

How often do you need the data to be refreshed? After each period, do you care for the entire table or just for the added rows?

At every trigger, I expect to read the full table (with the new delta added as well)

isha97 commented 4 months ago

@nishant-neo To reload the snapshot of the DataFrame, you can execute a new read which will have the latest snapshot of the table.