Open BorisTyshkevich opened 1 year ago
@BorisTyshkevich why not ReplicatedMergeTree the offset table ?
Replication could be lagged and it produce more duplicates. KeeperMap was introduced specially for such data as binlog position and works quite good.
It's even possible to make exactly one delivery by doing a two-stage commit with the next and last position.
Please note that it does not matter if the position is lagging, re-inserting to a replacing merge tree does not matter (few seconds lag is not a big deal)
RMT is not the best instrument to deal with duplicates. Not only FINAL speed is an issue. You can't set up aggregating MV to the table with duplicates. That's why exactly-once-delivery is better. The right thing is to send to Clickhouse the very same block after the connector restart to make checksum block deduplication work. That requires having both start and end positions for every insert block.
I am open to discussion but that's a design choice to use RMT. Final is not an issue for this use case, we should document the recommended settings.
@BorisTyshkevich please note that sink != sync.
Sink Connector connects to only one node at a time even for Clickhouse Cluster. In case of Replica fail we couldn't continue consuming events from the OLTP database as position and history stay on the unavailable server. We need to move internal connector tables to be cluster-wide.
For Offset I suggest KeeperMap for History - ReplicatedMergeTree. Here is the example: