geerlingguy / 50k-k8s-jobs

50,000 Kubernetes Jobs for 50,000 Subscribers
https://www.youtube.com/watch?v=O1iEBzY7-ok
MIT License
45 stars 5 forks source link

Figure out why Kubernetes can't seem to run 50k Jobs directly #4

Open geerlingguy opened 3 years ago

geerlingguy commented 3 years ago

I tried one of two ways:

  1. Dump 50,000 Job definitions on the API and let it sort things out.
  2. Drop either 25, 100, or 500 Jobs on the API in a batch, wait for the API to report all of those jobs 'Succeeded', then move on to next batch.

In both cases, the Linode clusters seemed to hit some sort of wall around 3,000-5,000 Jobs. My local cluster died (see #3) just under 3,000 Jobs.

If I create a batch, then delete that batch (and delete all the orphaned Pods from that batch—for some reason deletion propagation wasn't happening, in either 1.18 or 1.19—it didn't seem like any owner references were set on the Job Pods), then move on, I can get up to 50,000 Jobs (and likely beyond).

So my question is this: why does it seem like the scheduler starts to fall over at such a low number of Jobs? Surely there are clusters out there where people don't garbage collect Jobs and there are many, many thousands of Jobs, right? (And I'm not talking about CronJobs here, just Jobs).

I think I might open an issue in the K8s repo and see if there's any more official light to shine on this, since the docs are completely silent on any warnings about trying to run thousands of Jobs.

geerlingguy commented 3 years ago

I just created the following issue upstream, to see if anyone has further ideas to help figure out why the initial idea didn't work: https://github.com/kubernetes/kubernetes/issues/95492

geerlingguy commented 3 years ago

I'm going to run another cluster and get a time series graph of how long it takes per Job:

watch -n5 "kubectl get jobs -l type=50k --field-selector status.successful=1 | wc -l | awk -v date=\", \$(date)\" '{print \$1, date}' >> result.csv"
geerlingguy commented 3 years ago

I also asked Linode via support ticket here: https://cloud.linode.com/support/tickets/14647575 (have to be logged in as me to view ;).

markrity commented 3 years ago

@geerlingguy thats awesome that you keep track of it and handing it with your finding to kubernetes mainterners. would be awesome to see maybe a screenshot of what linode answered to your question (I guess the final conclusion after investigation or something)

geerlingguy commented 3 years ago

@markrity - Don't worry, I'll keep things updated here :)

geerlingguy commented 3 years ago

So to sum up some of the things I've learned:

Here are some of the graphs and the CSV raw data used to build them, using watch -n5 "kubectl get jobs -l type=50k --field-selector status.successful=1 | wc -l | awk -v date=\", \$(date)\" '{print \$1, date}' >> result.csv" to dump the data into a CSV.

Linode (defaults)

jobs-linode-default

2 GB of RAM limit for Docker for Mac

With scheduler/controller --leader-elect=true

jobs-local-2gb-true

With scheduler/controller --leader-elect=false

jobs-local-2gb-false

8 GB of RAM limit for Docker for Mac

With scheduler/controller --leader-elect=true

jobs-local-8gb-true

With scheduler/controller --leader-elect=false

jobs-local-8gb-false

And here's a zip file containing all the CSV files:

CSV-data.zip

geerlingguy commented 3 years ago

The fact that the Linode and 2 GB RAM graphs on my Mac line up so perfectly, with the inflection point around 3,000 Jobs, makes me strongly suspect the Linode master has 2 GB of RAM by default.