Draymonders / Code-Life

The marathon continues though.
27 stars 3 forks source link

kafka调研与实践 #86

Open Draymonders opened 4 years ago

Draymonders commented 4 years ago

名词解释

Kafka为什么客户端不可读写Follower副本

Draymonders commented 4 years ago

kafka环境准备

kafka-client-tool

获取kafka版本

https://blog.csdn.net/boling_cavalry/article/details/85395080

Draymonders commented 4 years ago

kafka命令初试

其中zk-kafka_kafka_1是docker container唯一标识

  1. 创建topic
    docker exec zk-kafka_kafka_1 kafka-topics.sh --create --topic draymonder --partitions 3 --zookeeper zookeeper:2181 --replication-factor 1
  2. 查看所有的topic
    docker exec zk-kafka_kafka_1 kafka-topics.sh --list --zookeeper zookeeper:2181 draymonder
  3. 查看刚创建的topic信息
    
    docker exec zk-kafka_kafka_1 kafka-topics.sh --describe --topic draymonder --zookeeper zookeeper:2181

Topic: draymonder PartitionCount: 3 ReplicationFactor: 1 Configs: Topic: draymonder Partition: 0 Leader: 1001 Replicas: 1001 Isr: 1001 Topic: draymonder Partition: 1 Leader: 1001 Replicas: 1001 Isr: 1001 Topic: draymonder Partition: 2 Leader: 1001 Replicas: 1001 Isr: 1001

4. 消费者消费消息
```shell
docker exec zk-kafka_kafka_1 kafka-console-consumer.sh --topic draymonder  --bootstrap-server 10.40.58.64:9092
  1. 生产者生产消息
    docker exec -it zk-kafka_kafka_1  kafka-console-producer.sh --topic draymonder --broker-list 10.40.58.64:9092
Draymonders commented 4 years ago

Producer

流程

1. new Producer()
2. producer.send()
3. producer.close() // 重要,必须要关闭close, 否则会导致缓冲区的数据发送不出去。 close会先把缓冲区的数据分发出去再关闭。

todo: 阅读producer源码

负载均衡

负载均衡需要client来定义, 继承Partitioner接口

消息保证

Draymonders commented 4 years ago

Consumer

Consumer group

Draymonders commented 4 years ago

Stream

数据流向

todo: 以后用到了再学

Draymonders commented 4 years ago

面试题

kafka概念

kafka吞吐量大

日志检索底层

零拷贝

消费者组

topic 删除

Draymonders commented 4 years ago

Kafka Paper

use case

traditional message system

  1. IBM Websphere MQ has transactional supports that allow an application to insert messages into multiple queues atomically. (保证消息插入的事务性, 原子性?)
  2. The JMS specification allows each individual message to be acknowledged after consumption.(消费每条消息都要ack)
  3. Those systems are weak in distributed support.(很少支持分布式)
  4. Finally, many messaging systems assume near immediate consumption of messages, so the queue of unconsumed messages is always fairly small. Their performance degrades significantly if messages are allowed to accumulate.(不允许堆积)
  5. Additionally, most of them use a “push” model in which the broker forwards data to consumers.At LinkedIn, we find the “pull” model more suitable for our applications since each consumer can retrieve the messages at the maximum rate it can sustain and avoid being flooded by messages pushed faster than it can handle. (大多数用的推模式, 但kafka是拉模式, 拉可以根据consumer控制速率,防止推送太快导致consumer扛不住).

the architecture of Kafka and its key design principles

  1. Unlike traditional iterators, the message stream iterator never terminates. If there are currently no more messages to consume, the iterator blocks until new messages are published to the topic.(消息迭代器不会终止, 不用重新建立tcp/ip链接)
  2. We support both the point-to-point delivery model in which multiple consumers jointly consume a single copy of all messages in a topic, as well as the publish/subscribe model in which multiple consumers each retrieve its own copy of a topic. (支持发布订阅 & 点对点)
  3. To balance load, a topic is divided into multiple partitions and each broker stores one or more of those partitions. (为了负载均衡,分成多个partition)

    Efficiency on a Single Partition

    Simple storage

  4. Kafka has a very simple storage layout. Each partition of a topic corresponds to a logical log.(消息就是存的日志)
  5. Physically, a log is implemented as a set of segment files of approximately the same size (e.g., 1GB). Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file. (日志会分段, 每段存到一个文件上)
  6. For better performance, we flush the segment files to disk only after a configurable number of messages have been published or a certain amount of time has elapsed. A message is only exposed to the consumers after it is flushed.(有一个缓冲队列,等队列数量满足 配置的数量 才会写入到log文件中)
  7. This avoids the overhead of maintaining auxiliary, seek-intensive random-access index structures that map the message ids to the actual message locations. (消息没有id, 不需要建索引,减少了随机访问的overhead)
  8. If the consumer acknowledges a particular message offset, it implies that the consumer has received all messages prior to that offset in the partition.(consumer提交offset,代表已经接收了offset之前的所有数据)
  9. Each pull request contains the offset of the message from which the consumption begins and an acceptable number of bytes to fetch. Each broker keeps in memory a sorted list of offsets, including the offset of the first message in every segment file. The broker locates the segment file where the requested message resides by searching the offset list, and sends the data back to the consumer. (consumer拉取消息,包含了offset, broker保存了一个有序的offset, 表示每个段文件的首offset, 然后根据consumer的offset去定位段文件并且找到消息集合) wbkMfU.png

Efficient transfer

  1. the producer can submit a set of messages in a single send request. Although the end consumer API iterates one message at a time, under the covers, each pull request from a consumer also retrieves multiple messages up to a certain size, typically hundreds of kilobytes.(producer批量写入,consumer批量读取)
  2. rely on the underlying file system page cache, avoiding double buffering messages are only cached in the page cache.. (依赖文件系统页缓存, 不存内存,避免了两层缓冲).Kafka doesn’t cache messages in process at all, it has very little overhead in garbage collecting its memory, making efficient implementation in a VM-based language feasible. (没有用很多内存,导致不需要gc).
  3. Zero copy
    • A typical approach to sending bytes from a local file to a remote socket involves the following steps.
      1. read data from the storage media to the page cache in an OS
      2. copy data in the page cache to an application buffer
      3. copy application buffer to another kernel buffer
      4. send the kernel buffer to the socket.
        • (零拷贝原理 todo)

          stateless broker

  4. the information about how much each consumer has consumed is not maintained by the broker, but by the consumer itself. (consumer掌管offset信息)
    • (由于不知道message被消费的信息, 所以删除消息是个麻烦事)
    • A message is automatically deleted if it has been retained in the broker longer than a certain period, typically 7 days.
  5. A consumer can deliberately rewind back to an old offset and reconsume data. We note that rewinding a consumer is much easier to support in the pull model than the push model.(offset可以回退,因为是pull模型,push模型的话实现会比较麻烦)

Distributed Coordination

  1. each message is delivered to only one of the consumers within the group (在一个consumer group中,一条message只能被一个consumer消费)
  2. Make a partition within a topic the smallest unit of parallelism. (partition 是最小的并行单位)
  3. not have a central “master” node, but instead let consumers coordinate among themselves in a decentralized fashion
    • use zookeeper (zk介绍)
      1. file system like api, crud, and list the children of a path
      2. one can register a watcher on a path and get notified when the children of a path or the value of a path has changed
      3. a path can be created as ephemeral (as oppose to persistent), which means that if the creating client is gone, the path is automatically removed by the Zookeeper server;
      4. zookeeper replicates its data to multiple servers, which makes the data highly reliable and available
  4. kafka use zookeeper for the following tasks
    • detect the addition and the removal of brokers and consumers (检测consumer * broker 增加/删除)
    • trigger a rebalance process when the above events happen (当上面事件发生,发生rebalance)
    • maintain the consumption relationship and keep track of the consumed offset of each partition
    • Each consumer registers a Zookeeper watcher on both the broker registry and the consumer registry, and will be notified whenever a change in the broker set or the consumer group occurs. (broker set / consumer group 状态发生改变,会通知)

Delivery Guarantees

  1. Kafka only guarantees at-least-once delivery. Exactly once delivery typically requires two-phase commits and is not necessary for our applications. (kafka保证至少一次消费, 正好一次需要两段提交, 但大多数系统不需要).
  2. in the case when a consumer process crashes without a clean shutdown, the consumer process that takes over those partitions owned by the failed consumer may get some duplicate messages that are after the last offset successfully committed to zookeeper. (consumer突然宕机,会导致可能最后的offset更新没有推送到zk or kafka里)
    • If an application cares about duplicates, it must add its own de- duplication logic, either using the offsets that we return to the consumer or some unique key within the message.(不想要重复,需要自行实现去重逻辑,如用offset or unique key).
  3. To avoid log corruption, Kafka stores a CRC for each message in the log.(防止log损坏,CRC校验)
youge-dev commented 4 years ago

kafka的leader分区挂掉后,kafka如何选择副本分区呢?

https://zhuanlan.zhihu.com/p/112536851