Kafka - Githubissues

Kafka brokers have a partition count limit, even if those partitions have no active traffic. We had assumed inactive partitions were free, but that’s not the case. Each partition has a CPU cost on the broker.

We spent a long long time trying to run a cluster with a high partition count but low throughout with a few brokers as possible. It was super unstable until we finally scaled the cluster horizontally.

HarshadRanganathan commented 1 year ago

use constants (for config, but also app name, group id, …), do not let auto topic creation in prod, TopologyTestDriver, don’t increase the nb. of partitions

HarshadRanganathan commented 1 year ago

Default batch.size and [http://linger.ms] values for producers are probably too low. Increasing them could save tens of thousands of dollars in Kafka infra cost.

HarshadRanganathan commented 1 year ago

Have a topic naming and creation strategy
Use well defined groupIds that correlate to the apps
Avoid dual writes.. DB & Kafka
Adopt SRP
Use a Schema Registry
Focus on Monitoring & Alerting
Build in app robustness early, upgrades, repartitioning, kill brokers

HarshadRanganathan commented 1 year ago

Data skew caused by unbalanced message keys is really hard to fix. Think twice about the keys you chose (it's worth profiling them and gathering statistics).

HarshadRanganathan commented 1 year ago

When you are starting out, use the Producer and Consumer APIs. Learn those well and then use a framework (Kafka Streams, Spring Kafka, etc) at a later stage.
The partition key matters a lot, so thinking about your key's distribution can avoid performance problems.

HarshadRanganathan commented 1 year ago

Partition rebalancing will bite you hard. Plan for it before it's too late.

HarshadRanganathan commented 1 year ago

https://forum.confluent.io/t/partitioning-gotchas-dont-use-avro-json-or-protobuf-for-keys-and-be-aware-of-client-hashing-differences/2718

HarshadRanganathan commented 1 year ago

https://www.kai-waehner.de/blog/2022/01/04/when-not-to-use-apache-kafka/

HarshadRanganathan commented 1 year ago

https://www.confluent.io/blog/apache-kafka-ci-cd-with-github/

HarshadRanganathan commented 1 year ago

https://cwiki.apache.org/confluence/display/KAFKA/KIP-578%3A+Add+configuration+to+limit+number+of+partitions

HarshadRanganathan commented 1 year ago

https://www.confluent.io/blog/kafka-streams-vs-ksqldb-compared/

HarshadRanganathan commented 1 year ago

https://www.kai-waehner.de/blog/2020/03/12/can-apache-kafka-replace-database-acid-storage-transactions-sql-nosql-data-lake/

HarshadRanganathan commented 1 year ago

https://newrelic.com/blog/how-to-relic/distributed-tracing-with-kafka

HarshadRanganathan commented 1 year ago

https://jaceklaskowski.gitbooks.io/apache-kafka/content/

HarshadRanganathan commented 1 year ago

https://felipevolpone.medium.com/consuming-over-1-billion-kafka-messages-per-day-at-ifood-2465e1ffa795

HarshadRanganathan commented 1 year ago

https://sixfold.medium.com/bringing-kafka-based-architecture-to-the-next-level-using-simple-postgresql-tables-415f1ff6076d

HarshadRanganathan commented 1 year ago

https://rockset.com/blog/kafka-vs-kinesis-choosing-the-best-data-streaming-solution/

https://www.kai-waehner.de/blog/2022/08/18/why-doordash-migrated-from-cloud-native-amazon-sqs-and-kinesis-to-apache-kafka-and-flink/

HarshadRanganathan commented 1 year ago

https://medium.com/@hardiktaneja_99752/lessons-after-running-kafka-in-production-626974ffd700

HarshadRanganathan commented 1 year ago

Event Sourcing:

https://itnext.io/event-sourcing-why-kafka-is-not-suitable-as-an-event-store-796e5d9ab63c

https://medium.com/dna-technology/why-we-dropped-event-sourcing-with-kafka-streams-when-given-a-second-chance-b904a80bc4be

https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/

HarshadRanganathan commented 1 year ago

Kafka Consumer/Producer Failures:

[1] Deserialization errors [2] Rebalance issues [3] NPE [4] Dead Letter topic [5] Poison pills

HarshadRanganathan commented 1 year ago

https://irori.se/blog/dealing-with-large-messages-in-kafka/

HarshadRanganathan commented 1 year ago

https://aiven.io/blog/balance-data-across-kafka-partitions

HarshadRanganathan commented 1 year ago

https://medium.com/@Irori/dangerous-default-kafka-settings-part-1-2ee99ee7dfe5

HarshadRanganathan commented 1 year ago

Kafka issues:

[1] If events for same key are published within few/same ms - then the order is not predicatable

[2] race condition issue - what is two processes are reading from same topic and updating same row in relational table - which one gets done first?

HarshadRanganathan commented 1 year ago

https://www.confluent.io/blog/debug-apache-kafka-pt-3/

HarshadRanganathan commented 1 year ago

https://dzone.com/articles/kafka-streams-tips-on-how-to-decrease-rebalancing

HarshadRanganathan commented 1 year ago

https://www.linkedin.com/pulse/avoiding-message-losses-duplication-lost-multiple-kafka-mahesh-abnave/

HarshadRanganathan commented 1 year ago

https://betterprogramming.pub/kafka-acks-explained-c0515b3b707e

HarshadRanganathan commented 1 year ago

https://medium.com/lydtech-consulting/kafka-consumer-auto-offset-reset-d3962bad2665

HarshadRanganathan commented 1 year ago

Kafka Streams Behavior:

Semantics

KStream	KTable	GlobalKTable
Insert/append-only	Update	Populate data from all partitions of the topic
Enabling log compaction will affect the semantics of data	Enable log compaction to save space

Timestamps

Whenever a Kafka Streams application writes records to Kafka, then it will also assign timestamps to these new records

Ordering

If two producers write to the same topic partition, there is no guarantee on the event append order.

Processing Guarantees

At least once by default

Exactly Once

When publishing a record with exactly-once semantics enabled, a write is not considered successful until it is acknowledged, and a commit is made to “finalize” the write

With exactly-once, multiple records are grouped into a single transaction, and so either all or none of the records are committed.

In the “read_committed” isolation level, the consumer will only return records from transactions that were committed, and any records that were not part of a transaction.

HarshadRanganathan commented 1 year ago

Kafka Consumer/Producer Behavior:

Batching

Records are batched at each partition level

Records larger than batch size won't be batched

Batch size -

maximum message batch size is a pre-compression limit on the producer
post-compression limit on the broker and consumer

Compression

ACK/Min-ISR

In summary, when acks=all with a replication.factor=N and min.insync.replicas=M we can tolerate N-M brokers going down for topic availability purposes

acks=all and min.insync.replicas=2 is the most popular option for data durability and availability and allows you to withstand at most the loss of one Kafka broker

However, if two out of three replicas are not available, the brokers will no longer accept produce requests. Instead, producers that attempt to send data will receive 'NotEnoughReplicasException'.

Replication

Auto Commit

Retries

Idempotence (Per Partition)

The producer send operation is now idempotent. In the event of an error that causes a producer retry, the same message—which is still sent by the producer multiple times—will only be written to the Kafka log on the broker once.

Each batch of messages sent to Kafka will contain a sequence number that the broker will use to dedupe any duplicate send.

Log Compaction

Auto Offset Reset

The decision on whether to consume from the beginning of a topic partition or to only consume new messages when there is no initial offset for the consumer group is controlled by the auto.offset.reset configuration

auto.offset.reset=earliest

Always reading the entire topic from start to end.
ensures that no messages are missed but messages might get replayed

auto.offset.reset=latest

Only caring about fresh messages arriving after the consumer connects.
May result in skipping data in some cases (e.g., if the consumer hadn’t finished consuming the most recent/latest message on the broker before it got disconnected).
Consumers would only read messages that arrived after they rejoined the group. With a retention policy of 3 days, this meant we lost the offset for the record that was 3 days old.Instead of resuming from the next available offset (which would have been roughly 48-72 hours old), Kafka assigned the latest record to the consumer, which in practice represented the latest record published to the topic within the last 10 minutes.
Consumer A dies before processing of the message has completed. The consumer offsets are not updated to mark the message as consumed. The consumer group rebalances, and Consumer B is assigned to the topic partition. As there is no valid offset, and auto.offset.reset is set to latest, the message is not consumed.

auto.offset.reset=none