Apache Kafka at the edge using Confluent operator

bitvijays commented 3 years ago

o/ Hope you are doing well. Thank you for creating confluent operator 👍

We are looking for deployment of Kafka at the edge using the operator and exploring different options how this could be achieved.

High - Level Edge Architecture

The architecture for Apache Kafka at the edge is usually like the following:

Image source

Left is the edge site and right is the cloud, usually performing replication using Mirror Maker2

Edge site (our case)

In our case, the edge site is usually 4GB Raspberry Pi installed in

Citizen's home
Street lamps on the road.

On the other hand, edge site can be

Restaurants
Retail Stores and others.

[Possible] different options for deploying Kafka at the edge

Use Cases and Architectures for Kafka at the Edge by Kai Waehner provides three different options

Case 1. One Kafka cluster is deployed at each site including additional Kafka components:

Case 2. Resilient Deployment at the Edge with 3+ Kafka Brokers

Case 3. Non-Resilient Deployment at the Edge with One Single Kafka Broker

The drawbacks with Non-resilient deployment are

No replication,
downtime in case of failure of the node or network,
risk of data loss

Implementing the Apache Kafka at the Edge using Confluent Operator

Assumption

The network connection between the cloud and edge is stable and pods are able to recover even if network is disrupted.

Case 1 and Case 2

If we implement a Kubernetes Control Plane at the Edge using k3s/k8s and then install confluent operator on the edge, this sounds good and implementable. This approach is good for edge sites which are large for example restaurants, retail stores. For example Chick-Fill-a Restaurants have this setup where they install K8S cluster at the edge on Intel NUC 8GB RAM, Quadcore processor, SSD and manage it using GitOps

Now, our use case is mainly gathering data from the end-devices such as Indoor Air Quality Monitors, Smart Plugs, Energy Monitoring and installing a Intel NUC with Kubernetes control plane at the edge is a overkill.

Case 3

As, we have Raspberry Pi 2/4 GB at the edge, we would like to install Non-Resilient Deployment at the Edge with One Single Kafka Broker. We understand that this has to be done with Replication factor =1, Topic partitions = 1.

Query

What are the possible implementation options for Case 3 using an operator and without using a k8s/k3s Kubernetes control plane at the edge.

We understand that, we can deploy a single kafka broker at the edge without operator, however, that is hard to manage. We have prototyped one Kubernetes pod with Kafka, MirrorMaker and Zookeeper, it works well, however security is missing, everything is plaintext and which is why, we want to manage Kafka cluster at the edge using a operator.

Any help is really appreciated.

amitkgupta commented 3 years ago

Hi @bitvijays, thanks for your thoughtful and interesting question! Let me put some thoughts out there for your consideration, would love to hear what you think.

Idea 1: No Kafka at the edge

As the article from @kaiwaehner that you referenced mentions, Kafka at the edge can help if you're doing any processing of the data at the edge, or you want to buffer at the edge in case of loss of network connection to the central Kafka cluster or generally to handle backpressure. It's not clear from your use case that you need to process at the edge, and sounds like you're assuming a stable connection to the central site. Would you consider having your applications and sensors send data directly to the central cluster or having a simpler stateless TCP proxy at the edge so application traffic passes through the proxy on to the central Kafka?

Idea 2: Kafka at the edge, no Kubernetes

Like Confluent Operator, one of the things the CP Ansible playbooks help with is security configuration management and is ideal for environments where Kubernetes is not an option.

Idea 3: Kafka and Kubernetes worker/agent at the edge, no Kubernetes server/control-plane at the edge

This option assumes you have a Kubernetes control plane somewhere, it just doesn't have to be at the edge. However you would have a worker/agent at the edge, and at minimum it will need to be able to connect to the servers/control plane wherever you happen to deploy that component. There may be other considerations to research as well when deploying Kubernetes in this architecture with agents/workers at the edge and the servers/control-plane in a remote location relative to the edge.

K3s documentation recommends 256MB+ and 5-10% CPU core as overhead for the agent at the edge, the rest can be utilized by your workloads (e.g. Kafka broker).

See here: https://rancher.com/docs/k3s/latest/en/installation/installation-requirements/resource-profiling/#k3s-agent

Idea 4: Kafka and full Kubernetes at the edge

To use Kubernetes, the minimum you need is a single Kubernetes server and you can have Kubernetes workloads run alongside that server. In other words, the Kubernetes server itself is also functioning as a worker/agent.

K3s documentation recommends 768-896MB+ and 10-20% CPU core as overhead for Kubernetes, the rest can be utilized by your workloads (e.g. Kafka broker). It sounded like you wanted to rule out the option of having a server at the edge but I wanted to make sure you were aware of these figures as they may actually fit within your constraints and in many ways this option is simpler than Idea 3.

See here: https://rancher.com/docs/k3s/latest/en/installation/installation-requirements/resource-profiling/#k3s-server-with-a-workload

bitvijays commented 3 years ago

Thank you @amitkgupta for the detailed reply. Really appreciated. Have provided more details about the requirements.

Idea 1: No Kafka at the edge

As the article from @kaiwaehner that you referenced mentions, Kafka at the edge can help if you're doing any processing of the data at the edge, or you want to buffer at the edge in case of loss of network connection to the central Kafka cluster or generally to handle backpressure. It's not clear from your use case that you need to process at the edge, and sounds like you're assuming a stable connection to the central site. Would you consider having your applications and sensors send data directly to the central cluster or having a simpler stateless TCP proxy at the edge so application traffic passes through the proxy on to the central Kafka?

Details on the use-case:

We want to collect sensor data (Indoor Air Quality, Smart Plugs, Energy monitoring) at the citizen house/ smart lamp-posts. Further, we want to do the following:

processing at the edge using Kafka streams. For instance a sensor value is higher than expected.
store and display the data to the end-user. For instance, the citizen might want to view the Air pollution inside the house using Grafana etc.
resiliency at the edge: A lot of times, the internet connection at the home goes down. In that case, kafka can store the data and later sync it with the cloud when internet is back.

Based on the use-case, we probably should have kafka at the edge.

Idea 2: Kafka at the edge, no Kubernetes

Like Confluent Operator, one of the things the CP Ansible playbooks help with is security configuration management and is ideal for environments where Kubernetes is not an option.

This sounds great. The problem is Ansible works using ssh (push mechanism). The devices (Raspberry Pi and other SBC) will be behind the user firewall (which we have no control of) and can't use to setup port forwarding to reach the SBC.

Is there a way to setup Apache Kafka securely using Puppet (pull mechanism) and can be behind the firewall?

We did a quick google search for Ansible behind the firewall and came across QBee - secure embedded linux iot device management and vpn. However, this requires payment for the service. Unfortunately, we can't use it as we want everything to be open-source.

Further, Having Kubernetes is preferred. For instance, We want to provide an option to the user to opt-out storing the data (about the sensors in home) at the cloud. In which case, we just deploy a pod with Kafka and Zookeeper rather than Kafka, zookeeper and Mirrormaker. I understand this could be possible using Puppet\Ansible.

Idea 3: Kafka and Kubernetes worker/agent at the edge, no Kubernetes server/control-plane at the edge

This option assumes you have a Kubernetes control plane somewhere, it just doesn't have to be at the edge. However you would have a worker/agent at the edge, and at minimum it will need to be able to connect to the servers/control plane wherever you happen to deploy that component. There may be other considerations to research as well when deploying Kubernetes in this architecture with agents/workers at the edge and the servers/control-plane in a remote location relative to the edge. K3s documentation recommends 256MB+ and 5-10% CPU core as overhead for the agent at the edge, the rest can be utilized by your workloads (e.g. Kafka broker). See here: https://rancher.com/docs/k3s/latest/en/installation/installation-requirements/resource-profiling/#k3s-agent

Yes, currently, we have something similar. We have Kubernetes control plane at the cloud and have worker/agent at the edge and are able to successfully deploy pods on the multiple edge machines (Raspberry Pi) at different house.

We want to use operator at the cloud to deploy kafka, zookeeper and mirrormaker securely. Further, as sensors would be sending the data using Kafka clients, it would be nice to expose kafka port (static port) as external so that sensors connected to the same network can access them. This would ensure that we have a static port on which kafka can be reached plus also helps in having standard configuration for the sensors.

Idea 4: Kafka and full Kubernetes at the edge

To use Kubernetes, the minimum you need is a single Kubernetes server and you can have Kubernetes workloads run alongside that server. In other words, the Kubernetes server itself is also functioning as a worker/agent. K3s documentation recommends 768-896MB+ and 10-20% CPU core as overhead for Kubernetes, the rest can be utilized by your workloads (e.g. Kafka broker). It sounded like you wanted to rule out the option of having a server at the edge but I wanted to make sure you were aware of these figures as they may actually fit within your constraints and in many ways this option is simpler than Idea 3.

Yes, this would be easier. However, we do want to avoid this. This is a kind of overkill for the usecase. For instance, few of the nodes at the edge are Raspberry Pi with 1 GB and 2 GB.

amitkgupta commented 3 years ago

Hey @bitvijays

Is there a way to setup Apache Kafka securely using Puppet (pull mechanism) and can be behind the firewall?

There may be, but I'm not familiar with this. Have you found anything online for Kafka + Puppet?

Also, have you considered ansible-pull: https://docs.ansible.com/ansible/latest/cli/ansible-pull.html

However, this requires payment for the service. Unfortunately, we can't use it as we want everything to be open-source.

Thanks for mentioning this. Just wanted to make sure you're aware, the Confluent Operator is also closed source, and requires a commercial license if using beyond a temporary trial period.

bitvijays commented 3 years ago

@amitkgupta Thank you for your reply.

Puppet does have a module for Kafka, however it does only install kafka. We don't see any security configuration on it.

Thank you for mentioning ansible pull, will have a look. Yes I am aware of the fact that Confluent Operator requires a commercial license.

On the other hand, we wanted to understand whether or not the operator can be used for automatic deployment at the edge with the below setup:

No Kubernetes server/control-plane at the edge.
Kafka and Kubernetes worker/agent at the edge.
Edge nodes are behind firewall with different network.

As you would probably agree, it is an interesting use-case and deployment with operator would be super-easy and preferred compared to ansible playbooks which would install packages at host node where as operator run the kafka setup in docker containers.

Please let us know If the operator is not the way forward or the scenarios/use-case is invalid for the operator, and the best possible option is to install kafka via ansible on edge host.

amitkgupta commented 3 years ago

Hey @bitvijays

On the other hand, we wanted to understand whether or not the operator can be used for automatic deployment at the edge with the below setup:

No Kubernetes server/control-plane at the edge.

Kafka and Kubernetes worker/agent at the edge.

Edge nodes are behind firewall with different network.

Yes, this architecture makes sense to me. You would of course need to test the implementation to validate it works for your hardware, resources, network performance, and application use cases, but this looks like a sound approach.

bitvijays commented 3 years ago

Thanks @amitkgupta 👍

In that case, would it be possible to guide us how we can create operator yaml files to achieve the following:

Run kafka broker with Replication factor =1, Topic partitions = 1 at the edge nodes. For instance if there are five nodes at five different houses (i.e different network and behind home firewall) , we need five kafka broker running independently with Replication factor =1, Topic partitions = 1 and possibly exposing the port externally (locally) so that sensors in the house (same local network containing the home router, sensors and the raspberry pi).
Run mirror maker also at the edge to sync the data with the cloud.

Maybe we can have yaml files similar to external-access-nodeport-deploy, external-access-static-host-based, external-access-static-port-based?

If there's something, we can help with, we would be happy to work on it.

amitkgupta commented 3 years ago

@bitvijays we hope to have some good lightweight examples for folks to use with Confluent Operator, enabling low-resource single-node deployments (for local quickstarts, but also a good starting point for edge deployments).

You can start with some of the YAML files already provided in this repo that most closely matches your target architecture. You'll to make some modifications, I think the following should give you a start:

---
kind: Zookeeper
spec:
  replicas: 1
  podTemplate:
    resources:
      requests:
        memory: ???Mi
  configOverrides:
    jvm: ["-Xms???M", "-Xmx???M"]
---
kind: Kafka
spec:
  replicas: 1
  podTemplate:
    resources:
      requests:
        memory: ???Mi
  configOverrides:
    jvm: ["-Xms???M", "-Xmx???M"]
    server:
      - "confluent.license.topic.replication.factor=1"
      - "confluent.tier.metadata.replication.factor=1"
      - "confluent.metadata.topic.replication.factor=1"
      - "confluent.balancer.topic.replication.factor=1"
      - "confluent.security.event.logger.exporter.kafka.topic.replicas=1"
      - "event.logger.exporter.kafka.topic.replicas=1"
      - "offsets.topic.replication.factor=1"

Replace ??? with whatever you want, maybe 512. There may be other changes necessary -- as mentioned we plan to work on and validate some of these examples. But if you can start with these pointers and get something working for your use case end to end, please do share!

bitvijays commented 3 years ago

@amitkgupta Hope you are doing well. Thank you for providing the initial yaml file. Just a quick question, how do we ensure that the zookeeper and kafka single node deployment runs on every node?. If we apply the above yaml file, it would deploy only on one node? Is there a way to ensure it running on some of all nodes, something like Kubernetes daemonset?

Have a good weekend :) Thank you again for your support. Much appreciated.

amitkgupta commented 3 years ago

Are you looking to have N distinct 1-broker Kafka clusters, or 1 single N-broker Kafka cluster?

amitkgupta commented 3 years ago

Same to you, have a good weekend!

bitvijays commented 3 years ago

@amitkgupta

At the end, what we want is

Data collected at the edge (would be used to display to the user, processing using kafka streams)
Synced at the cloud (Kafka) using mirror maker (running at the edge).

Whichever deployment, N distinct 1-broker Kafka clusters, or 1 single N-broker Kafka cluster, serves the purpose and easy to implement, we should be good with that.

Reading more about both N distinct 1-broker Kafka clusters, or 1 single N-broker Kafka cluster advantages, disadvantages.

Reading from Setting up a multi-broker cluster, it says For Kafka, a single broker is just a cluster of size one.
Reading from Configure Multi-Node Environment Kafka is a distributed system and data is read from and written to the partition leader. The leader can be on any broker in a cluster. When a client (producer or consumer) starts, it will request metadata about which broker is the leader for a partition. This request for metadata can come from any broker. The metadata that is returned will include the available endpoints for the lead broker of that partition. The client will use those endpoints to connect to the broker to read or write data as required.

In our case, the replication factor is 1, so there would be only one partition, so probably both deployment N distinct 1-broker Kafka clusters, or 1 single N-broker Kafka cluster should be fine? (We might be wrong, please do correct us.)

bitvijays commented 3 years ago

@amitkgupta Hope you are doing well. Did you got a chance to have a look at the above? Is there a possible way to deploy Kafka on multiple edge clients?

confluentinc / operator-earlyaccess