update of bosh director with bosh-init fails

holgero commented 8 years ago

I tried to update the stemcell (3212 to 3232.4) and ran bosh-init for that. The deployment went as usual until near the end but then it failed at the point where it waited for the instance to be running: Waiting for instance 'bosh/0' to be running... Failed

But the bosh director responded just fine to all requests afterwards (bosh status or bosh deployments worked as expected). When I looked into the director VM itself, I saw that monit summary reported Execution failed for the bosh director although the process was running, listening on port 25555 and it had the PID that stood in the pid-file under /var/vcap/sys/run/director/director.pid. In the monit log file (/var/vcap/monit/monit.log) I saw that the director was mentioned as failed about 30 seconds after it was started, but there was another entry about 5 seconds later that it was started successfully.

Here is the deployment manifest I used:

cloud_provider:
  mbus: https://mbus:<redacted>@192.168.1.11:6868
  properties:
    agent:
      mbus: https://mbus:<redacted>@0.0.0.0:6868
    blobstore:
      path: /var/vcap/micro_bosh/data/cache
      provider: local
    ntp:
    - timehost1.<redacted>
    - timehost2.<redacted>
    - timehost3.<redacted>
    openstack:
      api_key: <redacted>
      auth_url: https://<redacted>:5000/v2.0
      connection_options:
        ca_cert: |+
          -----BEGIN CERTIFICATE-----
          <redacted>
          -----END CERTIFICATE-----

      default_key_name: key
      default_security_groups:
      - bosh-default
      human_readable_vm_names: true
      tenant: <redacted>
      username: <redacted>
  ssh_tunnel:
    host: 192.168.1.11
    port: 22
    private_key: ../key.pem
    user: vcap
  template:
    name: openstack_cpi
    release: bosh-openstack-cpi
disk_pools:
- cloud_properties:
    availability_zone: <redacted>
  disk_size: 120000
  name: disks
jobs:
- instances: 1
  name: bosh
  networks:
  - default:
    - dns
    - gateway
    name: private
    static_ips:
    - 192.168.1.11
  persistent_disk_pool: disks
  properties:
    agent:
      mbus: nats://nats:<redacted>@192.168.1.11:4222
    blobstore:
      address: 192.168.1.11
      agent:
        password: <redacted>
        user: agent
      director:
        password: <redacted>
        user: director
      port: 25250
      provider: dav
    director:
      address: 127.0.0.1
      cpi_job: openstack_cpi
      db:
        adapter: postgres
        database: bosh
        host: 127.0.0.1
        password: <redacted>
        user: postgres
      max_tasks: 30000
      max_threads: 30
      name: bosh
      trusted_certs: |+
        -----BEGIN CERTIFICATE-----
        <redacted>
        -----END CERTIFICATE-----
      user_management:
        local:
          users:
          - name: admin
            password: <redacted>
          - name: hm
            password: <redacted>
          - name: concourse
            password: <redacted>
          - name: backup
            password: <redacted>
        provider: local
    dns:
      address: 192.168.1.11
      db:
        adapter: postgres
        database: bosh
        host: 127.0.0.1
        password: <redacted>
        user: postgres
    hm:
      director_account:
        password: <redacted>
        user: hm
      graphite:
        address: 10.1.3.1
        port: 2003
        prefix: CF
      graphite_enabled: true
      resurrector_enabled: true
      syslog_event_forwarder:
        address: 10.1.4.3
      syslog_event_forwarder_enabled: true
    host: 192.168.1.11
    nats:
      address: 127.0.0.1
      password: <redacted>
      user: nats
    ntp:
    - timehost1.<redacted>
    - timehost2.<redacted>
    - timehost3.<redacted>
    openstack:
      api_key: <redacted>
      auth_url: https://<redacted>:5000/v2.0
      connection_options:
        ca_cert: |+
          -----BEGIN CERTIFICATE-----
          <redacted>
          -----END CERTIFICATE-----

      default_key_name: key
      default_security_groups:
      - bosh-default
      human_readable_vm_names: true
      tenant: <redacted>
      username: <redacted>
    postgres:
      adapter: postgres
      database: bosh
      host: 127.0.0.1
      password: <redacted>
      user: postgres
    redis:
      address: 127.0.0.1
      password: <redacted>
    registry:
      address: 192.168.1.11
      db:
        adapter: postgres
        database: bosh
        host: 127.0.0.1
        password: <redacted>
        user: postgres
      endpoint: http://admin:<redacted>@192.168.1.11:25777
      host: 192.168.1.11
      http:
        password: <redacted>
        port: 25777
        user: admin
      password: <redacted>
      port: 25777
      username: admin
  resource_pool: vms
  templates:
  - name: nats
    release: bosh
  - name: redis
    release: bosh
  - name: postgres-9.4
    release: bosh
  - name: blobstore
    release: bosh
  - name: director
    release: bosh
  - name: health_monitor
    release: bosh
  - name: registry
    release: bosh
  - name: powerdns
    release: bosh
  - name: openstack_cpi
    release: bosh-openstack-cpi
name: bosh
networks:
- name: private
  subnets:
  - cloud_properties:
      net_id: <redacted>
      security_groups:
      - bosh
    dns:
    - 172.18.4.23
    - 172.18.4.24
    gateway: 192.168.1.1
    range: 192.168.1.0/24
    static:
    - 192.168.1.11
  type: manual
- name: public
  type: vip
releases:
- name: bosh
  sha1: 6b12652650b87810dcef1be1f6a6d23f1c0c13a7
  url: https://bosh.io/d/github.com/cloudfoundry/bosh?v=255.8
- name: bosh-openstack-cpi
  sha1: 6621ee1326e7136d9dbebacd1158d101618dc719
  url: https://bosh.io/d/github.com/cloudfoundry-incubator/bosh-openstack-cpi-release?v=24
resource_pools:
- cloud_properties:
    availability_zone: <redacted>
    instance_type: medium_4_8
  env:
    bosh:
      password: <redacted>
  name: vms
  network: private
  stemcell:
    sha1: 7f974927463bb44d3580ee16f4a0e8b9fe89202d
    url: https://bosh.io/d/stemcells/bosh-openstack-kvm-ubuntu-trusty-go_agent?v=3232.4

cppforlife commented 8 years ago

@holgero if you run monit reload on the director node. does it put the state back to running? there is potentially a monit problem that makes it confused about the state.

holgero commented 8 years ago

I didn't try monit reload, but after I did a sv restart monit, monit indeed reported the state of all jobs (including the director) as running.

voelzmo commented 8 years ago

monit restart all or monit restart director also fixed the problem. However, the question remains why the director start took more than 30s?

cunnie commented 8 years ago

I think I've seen the same behavior installing a BOSH Director via bosh-init to a t2.nano instance on AWS. I can try to replicate.

Update 6/2/2016: I tried deploying several times, and was unable to replicate failure (i.e. bosh-init deploy succeeded every time)

voelzmo commented 8 years ago

Current suspect: creation of certificates in the director_nginx takes too long due to too little entropy: https://github.com/cloudfoundry/bosh/blob/master/release/jobs/director/templates/nginx_ctl#L31-L36

dpb587-pivotal commented 7 years ago

Closing - feel free to reopen if you have more information or tested the entropy theory.

voelzmo commented 7 years ago

Just verified that this has nothing to do with entropy: Even with user-provided certificates the Director job fails. monit reload works, afterwards the Director job is shown as running.

Increasing the VM size didn't work, decreasing the amount of workers also didn't work. Any further ideas, @cppforlife?

voelzmo commented 7 years ago

After debugging with @cppforlife we've most likely identified the culprit:

The director startup script changes ownership recursively in /var/vcap/store/director, which contains the log, debug log, and result for all tasks executed on a director
On long running directors and IaaS layers with low persistent disk performance, this can take longer than 30 seconds, although all operations are no-ops (if the ownership is already set)
Monit timeout of 30 seconds hits, causing director job to be marked as failed, although it eventually comes up correctly

Solution: move creation and ownership changes to pre-start, which may take as long as it wants. Already done by other releases, such as consul-release

voelzmo commented 7 years ago

Story in our backlog to get it fixed: https://www.pivotaltracker.com/story/show/136281459

Kiemes commented 7 years ago

@cppforlife @tylerschultz We just pushed a commit to develop to fix this.

We manually tested this during an update process. A file which we gave different chmod attributes before updating the director had vcap:vcap after the update. As chmod does not print any output to stdout, the pre-start.stdout.log is empty. Prior to the change, the chmod action didn't write anything into director.stdout.log either.

Could you create a new BOSH release from develop or create a hotfix with these changes? After that, the issue can be closed.

cppforlife commented 7 years ago

we ll create a 260.1 today.

Sent from my iPhone

On Dec 19, 2016, at 6:58 AM, Tom Kiemes notifications@github.com wrote:

@cppforlife @tylerschultz We just pushed a commit to develop to fix this.

We manually tested this during an update process. A file which we gave different chmod attributes before updating the director had vcap:vcap after the update. As chmod does not print any output to stdout, the pre-start.stdout.log is empty. Prior to the change, the chmod action didn't write anything into director.stdout.log either.

Could you create a new BOSH release from develop or create a hotfix with these changes? After that, the issue can be closed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dpb587-pivotal commented 7 years ago

Fixed in v260.1.

cloudfoundry / bosh

update of bosh director with bosh-init fails #1276