geerlingguy / 50k-k8s-jobs

50,000 Kubernetes Jobs for 50,000 Subscribers
https://www.youtube.com/watch?v=O1iEBzY7-ok
MIT License
45 stars 5 forks source link

Build out a production cluster that is capable of running thousands of jobs #2

Closed geerlingguy closed 4 years ago

geerlingguy commented 4 years ago

As the title says. I believe I will be trying out Linode's offering for this task!

geerlingguy commented 4 years ago

So I'm doing it... But I've hit a snag:

$ kubectl get jobs --field-selector status.successful=1 | wc -l
    2540

$ kubectl get pods | wc -l
    3332

It seems that I can only get that many Pods and then the cluster just... stops.

geerlingguy commented 4 years ago

Trying again with a cluster of just 4 nodes, dedicated CPU. I can't see any issues with the nodes themselves (they all say they have plenty of capacity), so I'm wondering if maybe the control plane is hitting some hard limit. I can't look at kubelet on the nodes themselves :-/

geerlingguy commented 4 years ago

I think I might try a different approach. Instead of dumping thousands of job definitions on the cluster at once, and letting Kubernetes try to sort it all out at once (in parallel, which is probably killing it), I'm going to do the batch, wait for all those Jobs to complete, then do another batch, etc.

I'm starting with 10 job batches. It's excruciatingly slow, but it seems like it's actually allowing all jobs to complete so far—up to 3200 jobs at this point.

geerlingguy commented 4 years ago

This is too excruciatingly slow... I'm going to try setting ttlSecondsAfterFinished: 60 so Jobs get cleaned up (and hopefully the Pod associated with the Job)... and we'll see if that helps things.

geerlingguy commented 4 years ago

Hmm, nevermind, it looks like the TTLAfterFinished feature gate may not be enabled. I'll ask Linode about that. (Edit: support ticket - requires login).

geerlingguy commented 4 years ago

We're getting better, but far from success:

$ kubectl get pods | wc -l
    4601
geerlingguy commented 4 years ago

So, new method that seems to actually be working:

  1. Build batch of jobs.
  2. Wait for jobs to complete.
  3. Delete batch of jobs.
  4. Move on to next batch.

It seems the cluster is nice and speedy throughout now, instead of eventually slowing waaaay down. I'll see how many jobs I can get through. So far the pace is 3500 jobs per 30 minutes (50,000 in ~7 hours).

geerlingguy commented 4 years ago

Spoke too soon... Things are getting really, really slow around job number 5000. Right as I was gaining some confidence!

I'll let it go another 30 minutes to an hour, and we'll see what happens.

geerlingguy commented 4 years ago

So... it looks like when I delete a Job, the Job deletes just fine, but the Pod is not getting cleaned up:

$ kubectl get jobs --all-namespaces | wc -l
     101

$ kubectl get pods --all-namespaces | wc -l
    4929

(Confirmed by checking some of the Pods and seeing they still had all their data and logs associated.)

geerlingguy commented 4 years ago

Trying to figure out why the Pods aren't deleted when the Jobs they are owned by are deleted:

Controlled By: Job/739 — so the Pod definitely has an owner reference.

Reading up on the Kubernetes Garbage Collection docs, it seems like the Pods are missing ownerReference metadata, which means they end up orphaned. I looked at this issue (https://github.com/kubernetes/kubernetes/issues/71975), and it seems like something is just not getting set correctly in my case... however I've already spent enough time getting this thing moving, so I'm going to just let it run through with my new manual fix (deleting orphaned jobs in each batch), and see what happens :)

geerlingguy commented 4 years ago

Yay, so that worked — 01:35:48 for 10,000 jobs:

PLAY RECAP **********************************************************************************************************************************************************************************
127.0.0.1                  : ok=141  changed=60   unreachable=0    failed=0    skipped=0    rescued=0    ignored=0   

Friday 09 October 2020  16:12:15 -0500 (0:00:00.027)       1:35:48.591 ******** 
=============================================================================== 
Wait for jobs to be removed. ------------------------------------------------------------------------------------------------------------------------------------------------------- 243.10s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 233.79s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 156.13s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 147.02s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 143.96s
Wait for jobs to be removed. ------------------------------------------------------------------------------------------------------------------------------------------------------- 130.47s
Wait for jobs to be removed. ------------------------------------------------------------------------------------------------------------------------------------------------------- 117.79s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 117.30s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.62s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.46s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.42s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.38s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.36s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.35s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.32s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.22s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.06s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.01s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 115.61s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 115.11s
Playbook run took 0 days, 1 hours, 35 minutes, 48 seconds

Extrapolated, 475 minutes (or ~8 hours) total for 50,000 Jobs. Hopefully. Going to try doing this on my wife's laptop tonight!

geerlingguy commented 4 years ago

Running now...

geerlingguy commented 4 years ago

giphy

Just passed 5,000 Jobs. Going to let it run through dinner / bedtime routine and come down and check on it in a few hours!

geerlingguy commented 4 years ago

Done, a tiny bit over 8 hours later.