apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.89k stars 4.27k forks source link

Spanner ReaderWrite Transactions #20493

Open damccorm opened 2 years ago

damccorm commented 2 years ago

We are currently using SpannerIO Mutations for writing Records to SpannerDB using both streaming and batch pipelines: 

However, our business requirements need more functionality in spannerio to be able to handle readwrite transactions. Some of the problems we are facing are:

Scenario 1: We need to make sure that the data that gets processed in the pipeline and committed into Spanner to be always the latest data. Hence we need the capability to be able to read / query specific field eg Date or SequenceNumber to compare with before committing the updated record into Spanner. Eg. only commit a certain record when the current event’s sequenceNumber is greater than whats already in the Spanner row.

Scenario 2: We need to run end of the day batch pipelines which is reconcile any missing data from streaming pipelines, but however if there is some latest data from streaming during the time of batch data write, we are unable to identify which is the latest row as they are two different pipelines

Scenario 3: In a streaming pipeline if an event was delayed and processed at later time by adding to backlog queue, we would not want to perform a Spanner update query on this event where there is relatively new event to the spanner db already.

If SpannerIO has some capability to create readwrite transactions. It would help us to extend it to be utilised in these kind of solutions. 

Could you please help us extend spannerio for such scenarios

Imported from Jira BEAM-10489. Original Jira may contain additional context. Reported by: sai.maduri.

faisalhasnain commented 1 year ago

any ideas when this gets implemented?