CS-6381 / FinalProject

9 stars 21 forks source link

Create kafka_overview.md #10

Closed maxwellmadewell closed 3 years ago

maxwellmadewell commented 3 years ago
Category Kafka
Overview Basics
  • Service bus to connect heterogeneous applications
  • Isolates producers from consumers
  • Producer Centric – does not block producers
Implementation Basics
  • 3 Main Components – Publisher, Cluster/Manager, Subscriber
  • Partitioning and stream of event packets containing data and transforming them into durable message brokers with cursors, supporting batch consumers that may be offline, or online consumers that want messages at low latency
  • Provides longer retention of messages (even after consumption) allowing re-consumption
  • Metadata stored at consumer level (not server level like other MQS)
  • No “master”, brokers treated as peers
  • Metadata of brokers maintained in Zookeeper(??later versions too??)
Use Cases
  • Distributed Pub/Sub Messaging system
  • Monitoring Data from Distributed Applications
  • Log aggregation services
  • Highly scalable
Features
  • Guarantees 0% data loss
  • 2 million writes / second
  • Optimized for data ingestion in real time
Scalability
  • Highly scalable
  • Adding consumers – doesn’t affect performance or require downtime
Failure Response
  • Resilient – creates replicas for failover
Message Consumption Model
  • PULL
Flexibility
  • General purpose Pub/Sub capability
Broker
  • Runs a cluster handling incoming high volume data streams
Message Delivery
  • Never Delivered
  • May be redelivered
  • Delivered Once
Message Ordering
  • Not known for message ordering
  • Need to utilize consistent hash exchange or sharding plugin
Storage Architecture
  • Records stored in log entry
  • Records stored as topics
Consumption Tracking
  • Does not track consumption
  • Consumers required to keep track of cursor
Record Retention Policy
  • Default – 7 days
  • Indefinite dependent on disk available
Record Replay
  • Yes
Data Transmission Protocols
  • Binary over TCP
  • Client initiates socket with queues, writes messages, waits for ack
Delivery Guarantees
  • at-most-once
  • at-least-once, and
  • also high-throughput exactly once semantics specifically aimed at end-to-end stream processing use cases
Replication
  • Synchronous – producer identifies lead from ZK and publishes – Written to log of lead replica and all followers of lead start pulling the message – order is ensured
  • Asynchronous – lead replica sends ACK to client w/o waiting for acknowledgement of receipt by other brokers
  • Resilient, durable messages with automatic recovery
Known For
  • Publisher Centric
    • High-Volume Publishers
      • On or Offline Consumers
        • Extremely high throughput with consistent latencies
          • Allows consumers to re-read messages
Other Features
  • Kafka Streams – helps to filter and transform data
  • Message Compression – GZIP/Snappy used to compress by producer
  • Message Batching
  • Must write separate publisher and subscribers – not embedded in application
Implementation Details
  • Backbone is message caching and storing on filesystem
  • Data is immediately written to OS kernal page
  • Caching/flushing of data to disk is configurable