Open nmusku opened 4 years ago
Streaming is on our roadmap, can you please elaborate more on your use case? Please feel free to contact us directly
Hi we have data flowing directly into big-query(via fluentd) in real-time. My use-case is to query/filter and transform that raw data to meaningful events using this spark-connector. The data ingested is based off timestamp so if there are delays in ingestion, I would like to go back in time (lets say: threshold of 15 minutes) as well and read the delayed data..Not sure how to achieve it via batch jobs : example: spark.read.format("bigquery").option("filter", "start_time > current-5 minutes").option("filter", "end_time > current") Might not work ^^
Note: The reads will be from view.
@davidrabinowitz any thoughts? Is it possible to use any timestamp or any offset?
@nmusku Yes, for the time being you can implement it with a query like you've suggested. BTW, you can also merge it:
spark.read.format("bigquery").option("filter", "start_time > current-5 minutes AND end_time > current")
ok one more ques are the events in big query ordered?
Are there any news on this? now with GA4 would be cool to get streaming integration in spark
Hi @davidrabinowitz ,
I am also interested in a readStream feature.
We have one ETL pipeline extracting campaign data from BigQuery and load data into our DeltaLake.
The struggle we face is to do incremental ETL without loading duplicated data into our deltalake. With readStream and checkpoint, hopefully this will be solved.
Could you maybe share more information on the timeline for readStream feature?
We are also interested in this use case.
We land data in bigquery in real time from sources such as fivetran / fluent.d etc.
We would like to build spark streaming applications off by starting spark.readStream.format("bigquery")
and trigger new micro-batches when new data arrives.
@davidrabinowitz any update on this topic? we're also interested in this.
Can you please elaborate on the use case, especially how to want to read?
@davidrabinowitz our use case is streaming read the incremental data from bigquery tables, something like, spark.readStream.format("bigquery").option("inc_col", "create_time"), and we can config the incremental column, each time it will only read the newly added data. do we support this now? any suggestions?
If not how can I do it with current connector? Any thoughts?