confluentinc / kafka-connect-bigquery

A Kafka Connect BigQuery sink connector
Apache License 2.0
3 stars 2 forks source link

Question: Intermediate table cleanup process after merge flush #399

Open Blackrobe opened 8 months ago

Blackrobe commented 8 months ago

Hi,

I was reading how the sink connector do merge flush event and reads that after a batch of data in the intermediate table has been merged, a DELETE FROM statement is executed as shown here https://github.com/confluentinc/kafka-connect-bigquery/blob/f176198503e37e743d04b7bf687d7d08571ea851/kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/MergeQueries.java

But I was confused on the purpose of this line 492

.append("AND _PARTITIONTIME IS NOT NULL")

When you're doing "DELETE FROM" and ignoring the null partition, doesn't this mean that there will be a batch number that will never be merged?

I was thinking what would happen in this scenario below:
... at 3:00 AM: Merge flush for batchNumber 3190
at 3:01 AM: Delete from intermediate table where batchNumber <= 3190 and _partitiontime is not null
at 4:00 AM: Merge flush for batchNumber 3191 ...

From this alone, the second step won't delete records that satisfies both "batchNumber = 3190" and "_partitiontime is null".

Doesn't this basically mean those record that are left in batchNumber 3190 won't ever be merged?