cloudfoundry / bpm-release

isolated bosh jobs
Apache License 2.0
34 stars 28 forks source link

bpm 1.2.1 fails to start processes #160

Closed maxmoehl closed 1 year ago

maxmoehl commented 1 year ago

We are using bpm to deploy HAProxy through haproxy-boshrelease.

At first we observed the error:

[ALERT]    (21) : [haproxy.main()] Cannot raise FD limit to 1000080, limit is 1024.

even though we have set the value appropriately in the bpm config:

$ cat /var/vcap/jobs/haproxy/config/bpm.yml
processes:
  - name: haproxy
    executable: /var/vcap/jobs/haproxy/bin/haproxy_wrapper
    additional_volumes:
      - path: /var/vcap/jobs/haproxy/config/cidrs
        writable: true
      - path: /var/vcap/jobs/haproxy/config/ssl
        writable: true
      - path: /var/vcap/sys/run/haproxy
        writable: true

    unsafe:
      unrestricted_volumes: []

    limits:
      open_files: 1024128
    capabilities:
      - NET_BIND_SERVICE

After manually trying to start the job for troubleshooting using bpm start haproxy we now see another error:

time="2023-05-11T08:15:48Z" level=error msg="runc run failed: unable to start container process: exec: \"/var/vcap/packages/bpm/bin/tini\": stat /var/vcap/packages/bpm/bin/tini: permission denied"

Any idea what is happening there or how to fix it? The release seemed to only contain a few version bumps.

lnguyen commented 1 year ago

Hello! Can you try getting into bpm container in the haproxy with bpm shell haproxy and try ulimit -n to see what the bpm container has? We were able to do bpm start haproxy. Do you have a manifest we could use as example to reproduce this?

abg commented 1 year ago

We're seeing what may be a related problem under pxc-release w/ bpm v1.2.1. This largely affects ubuntu-xenial and ubuntu-bionic stemcells, but less so for ubuntu-jammy.

We have CI jobs that validate pxc-release can handle high connection counts to a managed mysql database server. In CI, we started seeing those jobs fail today under bpm/1.2.1 after several hundred concurrent connections. Jammy also seems to be posing an artificial limit, but jammy's default limit appears to be much higher and not causing failures for us for jobs using that stemcell.

So, in a fresh deploy on either xenial or bionic w/ bpm/1.2.1 we observe a very low nofile limit of 4096 for our "proxy" job that runs under bpm:

# lsb_release -sc
xenial # <= identical on bionic
$ cat /proc/$(pidof proxy)/limits
...
Max open files            4096                 4096                 files
...

Although in our bpm.yml we have:

...
  limits:
    open_files: 1048576
...

If we bosh ssh and manually run bpm stop && bpm start, we see the limit goes away:

# lsb_release -sc
xenial # <= identical on bionic
# bpm stop proxy && bpm start proxy
# cat /proc/$(pidof proxy)/limits
...
Max open files            1048576              1048576              files
...

However if we monit restart the process we observe the limit resets back to 4096:

# lsb_release -sc
xenial # <= identical on bionic
# monit restart proxy
# ...wait a bit...
# cat /proc/$(pidof proxy)/limits
...
Max open files            4096                 4096                 files
...

We do see that the bosh monit process has a nofile hard limit of 4096:

# lsb_release -sc
xenial # <= identical on bionic
# cat /proc/$(pidof monit-actual)/limits
...
Max open files            1024                 4096                 files
...

Under jammy, this limit is higher, but not as high as we configure in our bpm.yml:

# lsb_release -sc
jammy
# cat /proc/$(pidof monit-actual)/limits
...
Max open files            1024                 524288               files
...

This limit also applies to our "proxy" job:

# lsb_release -sc
jammy
# cat /proc/$(pidof proxy)/limits
...
Max open files            524288               524288               files
...

Similarly if we bpm stop && bpm start, we see this limit goes away (i.e. we get the limit specified in our bpm.yml, 1048576), but under monit restart the limit seems to be capped by monit's hard limit.

Under any stemcell if we rollback to bpm/1.2.0 this problem appears to go away entirely:

# bpm --version
1.2.0
# cat /proc/$(pidof proxy)/limits
...
Max open files            1048576              1048576              files
...
# monit restart proxy
# cat /proc/$(pidof proxy)/limits
...
Max open files            1048576              1048576              files
...
abg commented 1 year ago

Offhand, looks like this may be related to https://github.com/golang/go/issues/59064 which was introduced in Go v1.20.4.

Experimentally rolling bpm/1.2.1 back locally to Go v1.20.3 also makes this issue disappear.

maxmoehl commented 1 year ago

@lnguyen I think all the information I could have provided is now already here. If you still want me to re-produce the issue on our system, please let me know and I will do so.

lnguyen commented 1 year ago

This should be in fixed in bpm/1.2.2 @maxmoehl

maxmoehl commented 1 year ago

We confirmed that the issue is fixed with 1.2.2 in our environments. Thank you for the quick response!