confluentinc / kafka-connect-bigquery

A Kafka Connect BigQuery sink connector
Apache License 2.0
3 stars 1 forks source link

De-duplication using insertId #410

Open IvanVas opened 4 months ago

IvanVas commented 4 months ago

BigQuery supports (limited) de-duplication by supplying insertId. https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency

Can this be used in the connector?

Best effort de-duplication
When you supply insertId for an inserted row, BigQuery uses this ID to support best effort de-duplication for up to one minute. That is, if you stream the same row with the same insertId more than once within that time period into the same table, BigQuery might de-duplicate the multiple occurrences of that row, retaining only one of those occurrences.

The system expects that rows provided with identical insertIds are also identical. If two rows have identical insertIds, it is nondeterministic which row BigQuery preserves.

De-duplication is generally meant for retry scenarios in a distributed system where there's no way to determine the state of a streaming insert under certain error conditions, such as network errors between your system and BigQuery or internal errors within BigQuery. If you retry an insert, use the same insertId for the same set of rows so that BigQuery can attempt to de-duplicate your data. For more information, see [troubleshooting streaming inserts](https://cloud.google.com/bigquery/docs/streaming-data-into-bigquery#troubleshooting).