GoogleCloudDataproc / spark-bigquery-connector

BigQuery data source for Apache Spark: Read data from BigQuery into DataFrames, write DataFrames into BigQuery tables.
Apache License 2.0
376 stars 197 forks source link

Do you have plans to support streaming in near future? Interested in readStream use-case spark.readStream.format("bigquery") #259

Open nmusku opened 4 years ago

nmusku commented 4 years ago

If not how can I do it with current connector? Any thoughts?

davidrabinowitz commented 4 years ago

Streaming is on our roadmap, can you please elaborate more on your use case? Please feel free to contact us directly

nmusku commented 4 years ago

Hi we have data flowing directly into big-query(via fluentd) in real-time. My use-case is to query/filter and transform that raw data to meaningful events using this spark-connector. The data ingested is based off timestamp so if there are delays in ingestion, I would like to go back in time (lets say: threshold of 15 minutes) as well and read the delayed data..Not sure how to achieve it via batch jobs : example: spark.read.format("bigquery").option("filter", "start_time > current-5 minutes").option("filter", "end_time > current") Might not work ^^

Note: The reads will be from view.

nmusku commented 4 years ago

@davidrabinowitz any thoughts? Is it possible to use any timestamp or any offset?

davidrabinowitz commented 4 years ago

@nmusku Yes, for the time being you can implement it with a query like you've suggested. BTW, you can also merge it:

spark.read.format("bigquery").option("filter", "start_time > current-5 minutes AND end_time > current")
nmusku commented 4 years ago

ok one more ques are the events in big query ordered?

rwagenmaker commented 3 years ago

Are there any news on this? now with GA4 would be cool to get streaming integration in spark

Magicbeanbuyer commented 2 years ago

Hi @davidrabinowitz ,

I am also interested in a readStream feature.

We have one ETL pipeline extracting campaign data from BigQuery and load data into our DeltaLake.

The struggle we face is to do incremental ETL without loading duplicated data into our deltalake. With readStream and checkpoint, hopefully this will be solved.

Could you maybe share more information on the timeline for readStream feature?

benney-au-le commented 2 years ago

We are also interested in this use case. We land data in bigquery in real time from sources such as fivetran / fluent.d etc. We would like to build spark streaming applications off by starting spark.readStream.format("bigquery") and trigger new micro-batches when new data arrives.

kaiseu commented 1 year ago

@davidrabinowitz any update on this topic? we're also interested in this.

davidrabinowitz commented 1 year ago

Can you please elaborate on the use case, especially how to want to read?

kaiseu commented 1 year ago

@davidrabinowitz our use case is streaming read the incremental data from bigquery tables, something like, spark.readStream.format("bigquery").option("inc_col", "create_time"), and we can config the incremental column, each time it will only read the newly added data. do we support this now? any suggestions?