Closed geerlingguy closed 4 years ago
So I'm doing it... But I've hit a snag:
$ kubectl get jobs --field-selector status.successful=1 | wc -l
2540
$ kubectl get pods | wc -l
3332
It seems that I can only get that many Pods and then the cluster just... stops.
Trying again with a cluster of just 4 nodes, dedicated CPU. I can't see any issues with the nodes themselves (they all say they have plenty of capacity), so I'm wondering if maybe the control plane is hitting some hard limit. I can't look at kubelet on the nodes themselves :-/
I think I might try a different approach. Instead of dumping thousands of job definitions on the cluster at once, and letting Kubernetes try to sort it all out at once (in parallel, which is probably killing it), I'm going to do the batch, wait for all those Jobs to complete, then do another batch, etc.
I'm starting with 10 job batches. It's excruciatingly slow, but it seems like it's actually allowing all jobs to complete so far—up to 3200 jobs at this point.
This is too excruciatingly slow... I'm going to try setting ttlSecondsAfterFinished: 60
so Jobs get cleaned up (and hopefully the Pod associated with the Job)... and we'll see if that helps things.
Hmm, nevermind, it looks like the TTLAfterFinished
feature gate may not be enabled. I'll ask Linode about that. (Edit: support ticket - requires login).
We're getting better, but far from success:
$ kubectl get pods | wc -l
4601
So, new method that seems to actually be working:
It seems the cluster is nice and speedy throughout now, instead of eventually slowing waaaay down. I'll see how many jobs I can get through. So far the pace is 3500 jobs per 30 minutes (50,000 in ~7 hours).
Spoke too soon... Things are getting really, really slow around job number 5000. Right as I was gaining some confidence!
I'll let it go another 30 minutes to an hour, and we'll see what happens.
So... it looks like when I delete a Job, the Job deletes just fine, but the Pod is not getting cleaned up:
$ kubectl get jobs --all-namespaces | wc -l
101
$ kubectl get pods --all-namespaces | wc -l
4929
(Confirmed by checking some of the Pods and seeing they still had all their data and logs associated.)
Trying to figure out why the Pods aren't deleted when the Jobs they are owned by are deleted:
Controlled By: Job/739
— so the Pod definitely has an owner reference.
Reading up on the Kubernetes Garbage Collection docs, it seems like the Pods are missing ownerReference
metadata, which means they end up orphaned. I looked at this issue (https://github.com/kubernetes/kubernetes/issues/71975), and it seems like something is just not getting set correctly in my case... however I've already spent enough time getting this thing moving, so I'm going to just let it run through with my new manual fix (deleting orphaned jobs in each batch), and see what happens :)
Yay, so that worked — 01:35:48 for 10,000 jobs:
PLAY RECAP **********************************************************************************************************************************************************************************
127.0.0.1 : ok=141 changed=60 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0
Friday 09 October 2020 16:12:15 -0500 (0:00:00.027) 1:35:48.591 ********
===============================================================================
Wait for jobs to be removed. ------------------------------------------------------------------------------------------------------------------------------------------------------- 243.10s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 233.79s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 156.13s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 147.02s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 143.96s
Wait for jobs to be removed. ------------------------------------------------------------------------------------------------------------------------------------------------------- 130.47s
Wait for jobs to be removed. ------------------------------------------------------------------------------------------------------------------------------------------------------- 117.79s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 117.30s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.62s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.46s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.42s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.38s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.36s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.35s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.32s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.22s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.06s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 116.01s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 115.61s
Wait until 500 jobs are successful. ------------------------------------------------------------------------------------------------------------------------------------------------ 115.11s
Playbook run took 0 days, 1 hours, 35 minutes, 48 seconds
Extrapolated, 475 minutes (or ~8 hours) total for 50,000 Jobs. Hopefully. Going to try doing this on my wife's laptop tonight!
Running now...
Just passed 5,000 Jobs. Going to let it run through dinner / bedtime routine and come down and check on it in a few hours!
Done, a tiny bit over 8 hours later.
As the title says. I believe I will be trying out Linode's offering for this task!