Try running multiple Kafka brokers and Zookeeper servers with our producers and consumers (using another of the conduktor/kafka-stack-docker-compose) configurations. Experiment with downing Kafka and Zookeeper containers.
How many containers being down can our system tolerate?
What happens to the Kafka system logs and the metrics that our binaries export? Did our alerts fire? If not, consider how they could be improved - remember, the point of them is to tell us when something's wrong!
[ ] Dealing with long-running jobs and load (challenging)
What does our system do if someone submits a very long-running job? Try testing this with the sleep command.
If this is an issue for the stable operation of our system, or for running jobs in a timely fashion, what can we do about this?
If your system had problems, did our alerts fire?
How can we prevent our consumers getting overloaded if compute-intensive jobs are submitted?
[ ] Security using Firecracker VMs (challenging)
In an earlier note it was mentioned that there are security issues with simply exec-ing code in this way.
A better solution would be to use a [Firecracker VM](https://github.com/firecracker-microvm/firecracker/) to run the cron commands. Firecracker is an open-source virtualization technology that lets us start lightweight virtual machines very quickly and cheaply. It was developed at AWS to support services like AWS Lambda.
Here are some demos and examples of projects built with Firecracker:
https://stanislas.blog/2021/08/firecracker/
https://jvns.ca/blog/2021/01/23/firecracker--start-a-vm-in-less-than-a-second/
There is a [Firecracker SDK for Golang](https://github.com/firecracker-microvm/firecracker-go-sdk). If you have a significant amount of extra time available, updating the system to run commands in Firecracker VMs instead of exec-ing the commands provided would be a very good challenge.
Extras:
[ ] Kafka Chaos
Try running multiple Kafka brokers and Zookeeper servers with our producers and consumers (using another of the conduktor/kafka-stack-docker-compose) configurations. Experiment with downing Kafka and Zookeeper containers.
How many containers being down can our system tolerate?
What happens to the Kafka system logs and the metrics that our binaries export? Did our alerts fire? If not, consider how they could be improved - remember, the point of them is to tell us when something's wrong!
[ ] Dealing with long-running jobs and load (challenging)
What does our system do if someone submits a very long-running job? Try testing this with the sleep command.
If this is an issue for the stable operation of our system, or for running jobs in a timely fashion, what can we do about this?
If your system had problems, did our alerts fire?
How can we prevent our consumers getting overloaded if compute-intensive jobs are submitted?
[ ] Security using Firecracker VMs (challenging)
In an earlier note it was mentioned that there are security issues with simply exec-ing code in this way.