I just passed 50,000 subscribers on my YouTube channel, so I thought I'd find a fitting way to celebrate.
For 10,000 subscribers, I ran 10,000 K8s Pods on an Amazon EKS cluster; you can find a video detailing that here: 10,000 Pods for 10k Subscribers.
So for 50,000, I was doing a few calculations and found that I'd have to pay a few hundred bucks (at least!) just for the time on an EKS cluster with enough workers to run 50,000 Pods simultaneously. So that was right out.
Instead, I'm building this Ansible playbook to automate the process of running 50,000 individual Jobs on a Kubernetes cluster!
And of course, I made a video detailing the process and thanking my subscribers: 50,000 Kubernetes Jobs for 50k Subcribers.
Make sure you have Ansible and Kind installed, then run the following:
kind create cluster
pip3 install ansible molecule[docker] yamllint ansible-lint openshift
ansible-galaxy install -r requirements.yml
molecule converge
When you're finished, run kind delete cluster
.
I guess this is a great opportunity to thank this project's sponsor, Linode; they not only sponsored my '50,000 Kubernetes Jobs for 50,000 Subscribers' video, but they also did two other very kind things:
Seriously! Go try out Linode using my link, and you can take some new infrastructure for a spin for free!
Anyways, since I knew I'd build this cluster on Linode, I went ahead and did the following:
export K8S_AUTH_KUBECONFIG=~/.kube/linode.yaml
molecule converge
That runs with all the defaults in the playbook. If you want to override them, the easiest way is to run ansible-playbook
and pass extra variables:
ansible-playbook main.yml --extra-vars="{'batch_size':500,'total_count':50000}"
Check on how many jobs have completed by monitoring:
kubectl get jobs -l type=50k --field-selector status.successful=1
Note: For efficiency's sake, each batch of jobs is deleted after it successfully runs (otherwise there seems to be hard limit of how many Jobs/Pods will remain present on the cluster, and the scheduler grinds to a halt). If you want to leave all jobs in place, add the extra variable
'inflight_cleanup':true
.
Because I encountered issues when running more than 3,000-5,000 Jobs in a given cluster, I set up the playbook to run a batch of Jobs, then delete all those Jobs (and their orphaned Pods), then move on to the next batch. This allowed the playbook to deploy all 50,000 Jobs.
But you can bypass that 'inflight cleanup' of each batch by setting inflight_cleanup: false
in the playbook extra vars, for example:
ansible-playbook main.yml --extra-vars="{'batch_size':500,'total_count':50000,'inflight_cleanup':false}"
Note that this configuration has not yet been successfully used to deploy more than 4,000-5,000 Jobs on a single cluster, but it does allow you to dump a LOT of Jobs into a cluster and see how far it can get.
This project was created by Jeff Geerling, author of Ansible for DevOps and Ansible for Kubernetes, in support of a video celebrating 50,000 subscribers on his YouTube channel.