Closed darshanmehta10 closed 6 years ago
cc: @reuvenlax @chamikaramj @jkff
Is this issue BigQuery specific?
Generally speaking, you can dedupe based on an arbitrary property with Distinct.withRepresentativeValueFn.
Thanks for the report! The main issue tracker for Beam is JIRA. I've created your issue there https://issues.apache.org/jira/browse/BEAM-2864 and will close this one.
Out of curiosity has there been any updates with this? I have a very similar use case and am wondering if there is a suggested workaround for the time being, and if this is a valid use case that will be supported in the future. Thanks!
This is a feature request and not an issue.
We are using the following code in dataflow to write the messages to BigQuery:
Now, we often come across the scenarios where we need to do a re-play and push the missing/invalid messages to the pipeline. It works fine for missing records, however, for invalid/corrupt records, it ends up duplicating the rows in BigQuery table.
So, we need a way to specify/perform an upsert based on a certain field in dataflow pipeline in order to prevent the duplicates.
Here's stackoverflow question I asked about this.