concourse / concourse-bosh-release

Concourse BOSH release
Apache License 2.0
28 stars 49 forks source link

Remove systemd process number limit #153

Closed mnitchev closed 3 years ago

mnitchev commented 3 years ago

The bionic stemcell contains systemd v237 which by default gives a limit of 4915 processes to system cgroups. This restriction can cause jobs to fail when under load. So we are reverting to the xenial default of pids.max = max.

We are seeing this pipeline where we have many periodic jobs running at the same time. The limit gets gets exhausted fairly quickly causing random jobs to fail. We have hit this same problem on garden a few months ago. You can find more info on that in this pivotal tracker story and this commit in our bosh release. SSH-ing on the concourse worker we can see that the garden.system cgroup has pids.max set to max, but since it is a child cgroup of concourse.system and that has it's pid.max set to the default of 4915, it gets limited too.

gcapizzi commented 3 years ago

Hi @taylorsilva! We see this change is not part of v7.3.2 as you release out of release/7.3.x instead of master: does this mean we should backport our PR to that branch in order to get it released soon-ish? Thanks!

taylorsilva commented 3 years ago

We're probably gonna release 7.4 soon based on the milestone progress: https://github.com/concourse/concourse/milestone/75

gcapizzi commented 3 years ago

@taylorsilva I don't see this PR in that milestone, you mean you're going to release out of master?

taylorsilva commented 3 years ago

Sorry, our release process is not clear for this repo. Basically whenever we're about to make a new release of the Concourse binary we'll create a new release/x.x.x branch on the main repo (concourse/concourse) and all packaging repos as well. We'll branch off the latest commit on master for all those repos, creating the latest release from that.

Therefore, new releases for this repo are only made when new Concourse releases are made. This does suck whenever there's a packaging-level fix made, we can't cut a new packaging-only release.

This is not the case for the helm chart. It is on its own release schedule.