GoogleCloudPlatform / DataflowJavaSDK

Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
http://cloud.google.com/dataflow
855 stars 324 forks source link

Add functionality to perform a dedupe based on specific field #603

Closed darshanmehta10 closed 6 years ago

darshanmehta10 commented 6 years ago

This is a feature request and not an issue.

We are using the following code in dataflow to write the messages to BigQuery:

BigQueryIO.writeTableRows()
    .to("table")
    .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
    .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND)
    .withSchema(schema);

Now, we often come across the scenarios where we need to do a re-play and push the missing/invalid messages to the pipeline. It works fine for missing records, however, for invalid/corrupt records, it ends up duplicating the rows in BigQuery table.

So, we need a way to specify/perform an upsert based on a certain field in dataflow pipeline in order to prevent the duplicates.

Here's stackoverflow question I asked about this.

aaltay commented 6 years ago

cc: @reuvenlax @chamikaramj @jkff

okorz001 commented 6 years ago

Is this issue BigQuery specific?

Generally speaking, you can dedupe based on an arbitrary property with Distinct.withRepresentativeValueFn.

jkff commented 6 years ago

Thanks for the report! The main issue tracker for Beam is JIRA. I've created your issue there https://issues.apache.org/jira/browse/BEAM-2864 and will close this one.

gil-naamani commented 6 years ago

Out of curiosity has there been any updates with this? I have a very similar use case and am wondering if there is a suggested workaround for the time being, and if this is a valid use case that will be supported in the future. Thanks!