Figure out why Kubernetes can't seem to run 50k Jobs directly

geerlingguy commented 3 years ago

I tried one of two ways:

Dump 50,000 Job definitions on the API and let it sort things out.
Drop either 25, 100, or 500 Jobs on the API in a batch, wait for the API to report all of those jobs 'Succeeded', then move on to next batch.

In both cases, the Linode clusters seemed to hit some sort of wall around 3,000-5,000 Jobs. My local cluster died (see #3) just under 3,000 Jobs.

If I create a batch, then delete that batch (and delete all the orphaned Pods from that batch—for some reason deletion propagation wasn't happening, in either 1.18 or 1.19—it didn't seem like any owner references were set on the Job Pods), then move on, I can get up to 50,000 Jobs (and likely beyond).

So my question is this: why does it seem like the scheduler starts to fall over at such a low number of Jobs? Surely there are clusters out there where people don't garbage collect Jobs and there are many, many thousands of Jobs, right? (And I'm not talking about CronJobs here, just Jobs).

I think I might open an issue in the K8s repo and see if there's any more official light to shine on this, since the docs are completely silent on any warnings about trying to run thousands of Jobs.

geerlingguy commented 3 years ago

I just created the following issue upstream, to see if anyone has further ideas to help figure out why the initial idea didn't work: https://github.com/kubernetes/kubernetes/issues/95492

geerlingguy commented 3 years ago

I'm going to run another cluster and get a time series graph of how long it takes per Job:

watch -n5 "kubectl get jobs -l type=50k --field-selector status.successful=1 | wc -l | awk -v date=\", \$(date)\" '{print \$1, date}' >> result.csv"

geerlingguy commented 3 years ago

I also asked Linode via support ticket here: https://cloud.linode.com/support/tickets/14647575 (have to be logged in as me to view ;).

markrity commented 3 years ago

@geerlingguy thats awesome that you keep track of it and handing it with your finding to kubernetes mainterners. would be awesome to see maybe a screenshot of what linode answered to your question (I guess the final conclusion after investigation or something)

geerlingguy commented 3 years ago

@markrity - Don't worry, I'll keep things updated here :)

geerlingguy commented 3 years ago

So to sum up some of the things I've learned:

LKE's control plane specs are under wraps, and I will not be able to find out how it operates. My best educated guess is they use a single node control plane (no HA) and it doesn't have a ton of RAM (maybe 2 GB?), and this is to save costs since the control plane is free.
If I increase the RAM allocated to my Docker for Mac VM from 2 GB to 8 GB, the environment is able to handle at least 8-10,000 Jobs (I cut it off at that point to save my poor laptop from burning up). With 2 GB of RAM available, it starts to have trouble around 3,000 Jobs.

Here are some of the graphs and the CSV raw data used to build them, using watch -n5 "kubectl get jobs -l type=50k --field-selector status.successful=1 | wc -l | awk -v date=\", \$(date)\" '{print \$1, date}' >> result.csv" to dump the data into a CSV.

Linode (defaults)

jobs-linode-default

2 GB of RAM limit for Docker for Mac

With scheduler/controller `--leader-elect=true`

jobs-local-2gb-true

With scheduler/controller `--leader-elect=false`

jobs-local-2gb-false

8 GB of RAM limit for Docker for Mac

With scheduler/controller `--leader-elect=true`

jobs-local-8gb-true

With scheduler/controller `--leader-elect=false`

jobs-local-8gb-false

And here's a zip file containing all the CSV files:

CSV-data.zip

geerlingguy commented 3 years ago

The fact that the Linode and 2 GB RAM graphs on my Mac line up so perfectly, with the inflection point around 3,000 Jobs, makes me strongly suspect the Linode master has 2 GB of RAM by default.

geerlingguy / 50k-k8s-jobs