Open HarshadRanganathan opened 2 years ago
Batch size config is not the number of records within the batch but the size in bytes.
The definition of batch size differs for connectors. I.e azure function sink connector batch.size equals to number of records.
batch.size in mongodb connector defines cursor batch size (numbers or events to return from mongodb)
potential data loss for all consumer groups when increasing partitions using auto.offset.reset = latest
Kafka brokers have a partition count limit, even if those partitions have no active traffic. We had assumed inactive partitions were free, but that’s not the case. Each partition has a CPU cost on the broker.
We spent a long long time trying to run a cluster with a high partition count but low throughout with a few brokers as possible. It was super unstable until we finally scaled the cluster horizontally.
use constants (for config, but also app name, group id, …), do not let auto topic creation in prod, TopologyTestDriver, don’t increase the nb. of partitions
Default batch.size and [http://linger.ms] values for producers are probably too low. Increasing them could save tens of thousands of dollars in Kafka infra cost.
Data skew caused by unbalanced message keys is really hard to fix. Think twice about the keys you chose (it's worth profiling them and gathering statistics).
Partition rebalancing will bite you hard. Plan for it before it's too late.
Kafka Consumer/Producer Failures:
[1] Deserialization errors [2] Rebalance issues [3] NPE [4] Dead Letter topic [5] Poison pills
Kafka issues:
[1] If events for same key are published within few/same ms - then the order is not predicatable
[2] race condition issue - what is two processes are reading from same topic and updating same row in relational table - which one gets done first?
Kafka Streams Behavior:
KStream | KTable | GlobalKTable |
---|---|---|
Insert/append-only | Update | Populate data from all partitions of the topic |
Enabling log compaction will affect the semantics of data | Enable log compaction to save space |
Whenever a Kafka Streams application writes records to Kafka, then it will also assign timestamps to these new records
If two producers write to the same topic partition, there is no guarantee on the event append order.
At least once by default
When publishing a record with exactly-once semantics enabled, a write is not considered successful until it is acknowledged, and a commit is made to “finalize” the write
With exactly-once, multiple records are grouped into a single transaction, and so either all or none of the records are committed.
In the “read_committed” isolation level, the consumer will only return records from transactions that were committed, and any records that were not part of a transaction.
Kafka Consumer/Producer Behavior:
Records are batched at each partition level
Records larger than batch size won't be batched
Batch size -
Compression
In summary, when acks=all with a replication.factor=N and min.insync.replicas=M we can tolerate N-M brokers going down for topic availability purposes
acks=all and min.insync.replicas=2 is the most popular option for data durability and availability and allows you to withstand at most the loss of one Kafka broker
However, if two out of three replicas are not available, the brokers will no longer accept produce requests. Instead, producers that attempt to send data will receive 'NotEnoughReplicasException'.
Replication
Auto Commit
Retries
The producer send operation is now idempotent. In the event of an error that causes a producer retry, the same message—which is still sent by the producer multiple times—will only be written to the Kafka log on the broker once.
Each batch of messages sent to Kafka will contain a sequence number that the broker will use to dedupe any duplicate send.
Log Compaction
The decision on whether to consume from the beginning of a topic partition or to only consume new messages when there is no initial offset for the consumer group is controlled by the auto.offset.reset configuration
auto.offset.reset=earliest
auto.offset.reset=latest
auto.offset.reset=none
To specify retention by time, we have to set
Expire messages is based on the total number of bytes of messages retained
Kafka:
https://strimzi.io/blog/2021/12/17/kafka-segment-retention/ https://www.confluent.io/blog/5-common-pitfalls-when-using-apache-kafka/
Streams:
https://medium.com/lydtech-consulting/kafka-streams-introduction-d7e5421feb1b
https://blog.rockthejvm.com/kafka-streams/
Optimization:
https://developers.redhat.com/articles/2022/05/03/fine-tune-kafka-performance-kafka-optimization-theorem#optimization_goals_for_kafka
https://strimzi.io/blog/2021/06/08/broker-tuning/
https://strimzi.io/blog/2021/01/07/consumer-tuning/
https://strimzi.io/blog/2020/10/15/producer-tuning/
https://medium.com/paypal-tech/kafka-consumer-benchmarking-c726fbe4000
https://medium.com/bigpanda-engineering/sleeping-good-at-night-kafka-configurations-tweaks-6dd4d3aaf4e5
https://www.conduktor.io/kafka/kafka-advanced-concepts
Operations:
https://strimzi.io/blog/2020/06/15/cruise-control/
Logstash:
https://discuss.elastic.co/t/multiple-logstash-reading-from-a-single-kafka-topic/27727