Bosh director degradation after deploying manifest with ipv6 IP address

cwb124 commented 5 years ago

Describe the bug We can successfully deploy manifests that have both ipv4 and ipv6 addresses in the manifest, however after doing so the bosh director will become unresponsive and timeout on subsequent deployments. Ruby eats up 95%+ of bosh director memory and begins excessive swapping and the director is unable to deploy anything else.

To Reproduce Steps to reproduce the behavior (example):

Deploy a bosh director on vsphere
Upload trusty or xenial stemcell
Deploy manifest with ipv4 and ipv6 addresses that has the networks all declared in the manifest
Once the deployment is finished, try to deploy something else. Director should begin eating up all memory and become mostly unresponsive.

Expected behavior I would expect subsequent bosh deployments to succeed.

Logs Logs are always helpful! Add logs to help explain your problem.

Versions (please complete the following information):

Vsphere
Bosh 268.6.0
BOSH CLI version 5.x
Vsphere CPI 52.1.0
Stemcell version trusty 3468/3568. Xenial 97/170/250
.Using homegrown HAproxy release

Deployment info: `--- compilation: cloud_properties: cpu: 4 disk: 8192 ram: 8192 network: ipv4 reuse_compilation_vms: true workers: 1 director_uuid: REDACTED jobs:

instances: 1 lifecycle: service name: haproxy networks:
- default:
- dns
- gateway name: ipv4 static_ips:
- REDACTED
- name: ipv6 static_ips:
- REDACTED properties: haproxy: forcetls12: false backend:
  - REDACTED blacklist:
  - REDACTED ipv6_enabled: true nbproc: 4 syslog_server: REDACTED haproxy-cf-config: gslb_api_token: REDACTED gslb_api_uri: REDACTED gslb_api_user: REDACTED site: REDACTED zone: REDACTED resource_pool: c-haproxy-partition templates:
- name: haproxy release: c-cf-haproxy
- name: haproxy-cf-config release: c-cf-haproxy update: max_in_flight: 1 name: REDACTED networks:
name: ipv4 subnets:
- cloud_properties: name: REDACTED dns:
- REDACTED
- REDACTED gateway: REDACTED range: REDACTED reserved:
- REDACTED
- REDACTED static:
- REDACTED
name: ipv6 subnets:
- cloud_properties: name: REDACTED gateway: REDACTED range: 2001:0558:fe16:0039:0000:0000:0000:0000/64 static:
- REDACTED type: manual releases:
name: c-cf-haproxy version: '16'
name: c-telegraf version: '9' resource_pools:
cloud_properties: cpu: 4 datacenters:
- clusters:
  - REDACTED: resource_pool: disk: 16384 ram: 8192 name: c-haproxy-partition network: ipv4 stemcell: name: bosh-vsphere-esxi-ubuntu-xenial-go_agent version: '250.38' update: canaries: 1 canary_watch_time: 30000-300000 max_errors: 2 max_in_flight: 1 serial: true update_watch_time: 30000-300000`

Additional context If I ssh to the bosh director while it is in degraded state and run ps -o pid,user,%mem,command ax | sort -b -k3 -r, I get the following: ruby /var/vcap/packages/director/bin/bosh-director-worker -c /var/vcap/jobs/director/config/director.yml -i 3

cf-gitbot commented 5 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/166301143

The labels on this github issue will be updated when the story is started.

mrosecrance commented 5 years ago

Can you recreate this with the zookeeper release we have in our docs. We haven't been able to reproduce this on vsphere. It would be helpful to validate that this is an issue with the director and not your custom release.

jfmyers9 commented 5 years ago

Hi @cwb124,

We are going to close this issue due to inactivity. As mentioned, we failed to reproduce the symptoms that you are describing, if you happen to run into this issue again, could you please reopen a new issue with some more information about what the worker was doing while in the degraded state? Some things that could be helpful would be:

Logs for the director processes
strace of the ruby process consuming all of the memory

Let us know if you have any questions.

@jfmyers9 && @xtreme-conor-nosal

cloudfoundry / bosh

Bosh director degradation after deploying manifest with ipv6 IP address #2185