Open abdelhakimbendjabeur opened 9 months ago
I think this may be related to https://github.com/confluentinc/kafka-connect-bigquery/pull/333
But I see 2.6 as RC since September. Do you folks have a due date to release it?
@sp-gupta @b-goyal Sorry to ping you, do you have any insight regarding this issue?
Hello @abdelhakimbendjabeur. We are facing a similar issue with our BigQuery Sink connector deployment. I'm interested in this topic. Have you find any way of mitigating this?
Hi @andrelu No progress on our side on this one. We had to rerun the connector to cover the missing data. This is not ideal as it's more expensive.
Hello 👋
I have been experiencing some data consistency issues when sinking a single Kafka topic to a BigQuery table using the connector
Version
wepay/kafka-connect-bigquery 2.5.0
cp-kafka-connect-base:7.4.0
Source Topic
The source topic contains CDC data from a PG table, it goes though a Flink pipeline that add a few extra columns and filterd rows based on some criteria. ->
timestamp
refers to the moment when the PG transaction occurred (insert, update, delete) ->event_id
is UUID that is unique per payload, it's hash to identify unique events.Connector configuration
key.converter=org.apache.kafka.connect.json.JsonConverter key.converter.schemas.enable=false
value.converter=io.apicurio.registry.utils.converter.AvroConverter value.converter.schemas.enable=true
value.converter.apicurio.registry.url=http://apicurio-registry.apicurio.svc.cluster.local:8080/apis/registry/v2 value.converter.apicurio.auth.username=uu1 value.converter.apicurio.auth.password=xxx value.converter.apicurio.registry.as-confluent=false
offset.storage.topic=kc.internal.analytics-cdc-filtered.offsets offset.storage.replication.factor=3 config.storage.topic=kc.internal.analytics-cdc-filtered.configs config.storage.replication.factor=3 status.storage.topic=kc.internal.analytics-cdc-filtered.status status.storage.replication.factor=3 offset.flush.interval.ms=10000 plugin.path=/usr/share/java,/usr/share/confluent-hub-components,/usr/share/java/kafka-connect-plugins/
I use 2 custom SMT that do not ignore records
CopyFieldFromKeyToValue -> add a new field to the payload from the record key. SetToEpoch -> if the value is negative, replace by 0.
Deployment on Kubernetes
PS. I deliberately put high resources because I already had the data consistency issues and I thought it was related multiple restarted caused by throttling on CPU/Memory when the resources were not enough.
Bug description
After deploying the connector, I waited for it to reach the tail of the topic before running some checks. I noticed some records missing. How? -> I have another pipeline that sinks records to ClickHouse for analytics purposes and the records are there.
How I proceeded?
GROUP TOPIC PARTITION NEW-OFFSET connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 9 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 11 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 3 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 2 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 5 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 8 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 6 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 1 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 7 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 10 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 0 0 connect-analytics-cdc-filtered-bq-sink analytics.cdc.ticket-message 4 0
After noticing that the consumer-group is at the tail, run the same query on both the table with duplicates and the backup.
What I discovered is the second run had brought new records, meaning the first run skipped them somehow, which is very concerning... Note the rows that are circled in red, they show more unique tickets/messages/events on the
run_after_offset_reset
data.The fact that it has more unique tickets/messages/events means that these records are indeed in the topic and there were somehow missed during the first sink.
I am having trouble understanding where the problem comes from. No weird error logs have been noticed.
Has anybody experienced something similar or if there is something wrong with the config, I'd love to hear about it.
Thank you 🙏