cloudfoundry / bosh

Cloud Foundry BOSH is an open source tool chain for release engineering, deployment and lifecycle management of large scale distributed services.
https://bosh.io
Apache License 2.0
2.04k stars 657 forks source link

Bosh director degradation after deploying manifest with ipv6 IP address #2185

Closed cwb124 closed 5 years ago

cwb124 commented 5 years ago

Describe the bug We can successfully deploy manifests that have both ipv4 and ipv6 addresses in the manifest, however after doing so the bosh director will become unresponsive and timeout on subsequent deployments. Ruby eats up 95%+ of bosh director memory and begins excessive swapping and the director is unable to deploy anything else.

To Reproduce Steps to reproduce the behavior (example):

  1. Deploy a bosh director on vsphere
  2. Upload trusty or xenial stemcell
  3. Deploy manifest with ipv4 and ipv6 addresses that has the networks all declared in the manifest
  4. Once the deployment is finished, try to deploy something else. Director should begin eating up all memory and become mostly unresponsive.

Expected behavior I would expect subsequent bosh deployments to succeed.

Logs Logs are always helpful! Add logs to help explain your problem.

Versions (please complete the following information):

Deployment info: `--- compilation: cloud_properties: cpu: 4 disk: 8192 ram: 8192 network: ipv4 reuse_compilation_vms: true workers: 1 director_uuid: REDACTED jobs:

Additional context If I ssh to the bosh director while it is in degraded state and run ps -o pid,user,%mem,command ax | sort -b -k3 -r, I get the following: ruby /var/vcap/packages/director/bin/bosh-director-worker -c /var/vcap/jobs/director/config/director.yml -i 3

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/166301143

The labels on this github issue will be updated when the story is started.

mrosecrance commented 5 years ago

Can you recreate this with the zookeeper release we have in our docs. We haven't been able to reproduce this on vsphere. It would be helpful to validate that this is an issue with the director and not your custom release.

jfmyers9 commented 5 years ago

Hi @cwb124,

We are going to close this issue due to inactivity. As mentioned, we failed to reproduce the symptoms that you are describing, if you happen to run into this issue again, could you please reopen a new issue with some more information about what the worker was doing while in the degraded state? Some things that could be helpful would be:

Let us know if you have any questions.

@jfmyers9 && @xtreme-conor-nosal