cloudfoundry / cf-deployment

The canonical open source deployment manifest for Cloud Foundry
Apache License 2.0
294 stars 306 forks source link

cc_uploader on cc_bridge failed to start due to port binding issue #162

Closed bstick12 closed 7 years ago

bstick12 commented 7 years ago

We were deploying CF on bosh-lite following the instructions on README and the deployment failed with this error:

10:31:26 | Updating instance diego-cell: diego-cell/14c93c79-ecd8-4bc1-b6bf-db894ea00207 (0) (canary) (00:03:15)
10:48:57 | Updating instance cc-bridge: cc-bridge/5be970c1-0d63-4d60-bb8c-b62b5ce4726e (0) (canary) (00:20:46)
            L Error: 'cc-bridge/0 (5be970c1-0d63-4d60-bb8c-b62b5ce4726e)' is not running after update. Review logs for failed jobs: cc_uploader

10:48:57 | Error: 'cc-bridge/0 (5be970c1-0d63-4d60-bb8c-b62b5ce4726e)' is not running after update. Review logs for failed jobs: cc_uploader

We ssh'ed into the VM and found this repeated several times in /var/vcap/sys/log/cc_uploader/cc_uploader.stdout.log

{"timestamp":"1497265361.568850994","source":"cc-uploader","message":"cc-uploader.ready","log_level":1,"data":{}}
{"timestamp":"1497265361.568930149","source":"cc-uploader","message":"cc-uploader.exited-with-failure","log_level":2,"data":{"error":"Exit trace for group:\ncc-uploader exited with error: listen tcp 0.0.0.0:9090: bind: address already in use\ndebug-server exited with nil\n"}}

And metron process was bound on ::::9090

We ran the deployment again and it succeeded. We were curious why it had failed so we decided to investigate further and found following config files might be responsible for the failure:

The cc_uploader specifies port 9090 as the listener address and the the metron defines port 9090 as the health endpoint port.

It would look like there is a race condition as to which process grabs the port first.

Here are the config files:

metron_agent.json

{
  "Index": "5be970c1-0d63-4d60-bb8c-b62b5ce4726e",
  "Job": "cc-bridge",
  "Zone": "z1",
  "Deployment": "bosh-lite.com",
  "IP": "10.244.0.140",
  "Tags": {
    "deployment": "bosh-lite.com",
    "job": "cc-bridge",
    "index": "5be970c1-0d63-4d60-bb8c-b62b5ce4726e",
    "ip": "10.244.0.140"
  },
  "IncomingUDPPort": 3457,
  "DisableUDP": false,
  "PPROFPort": 0,
  "HealthEndpointPort": 9090,
  "GRPC": {
    "Port": 3458,
    "KeyFile": "/var/vcap/jobs/metron_agent/config/certs/metron_agent.key",
    "CertFile": "/var/vcap/jobs/metron_agent/config/certs/metron_agent.crt",
    "CAFile": "/var/vcap/jobs/metron_agent/config/certs/loggregator_ca.crt"
  },
  "DopplerAddr": "doppler.service.cf.internal:8082",
  "DopplerAddrUDP": "doppler.service.cf.internal:3457"
}

cc_uploader_config.json

{
    "cc_ca_cert": "/var/vcap/jobs/cc_uploader/config/certs/cc/ca.crt",
    "cc_client_cert": "/var/vcap/jobs/cc_uploader/config/certs/cc/client.crt",
    "cc_client_key": "/var/vcap/jobs/cc_uploader/config/certs/cc/client.key",
    "consul_cluster": "http://127.0.0.1:8500",
    "debug_server_config": {
        "debug_address": "127.0.0.1:17018"
    },
    "dropsonde_port": 3457,
    "lager_config": {
        "log_level": "info"
    },
    "listen_addr": "0.0.0.0:9090",
    "log_level": "info",
    "mutual_tls": {
        "ca_cert": "/var/vcap/jobs/cc_uploader/config/certs/cc_uploader/ca.crt",
        "listen_addr": "0.0.0.0:9091",
        "server_cert": "/var/vcap/jobs/cc_uploader/config/certs/cc_uploader/server.crt",
        "server_key": "/var/vcap/jobs/cc_uploader/config/certs/cc_uploader/server.key"
    }
}
cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/147015109

The labels on this github issue will be updated when the story is started.

dsabeti commented 7 years ago

@bstick12, thanks for the very detailed explanation of the issue you're seeing.

tl;dr: you probably just need to upgrade to the latest cf-deployment/loggregator.

It looks like there have been several commits in the last month to add the health check port, and then to prevent it from colliding with other components:

The latest version should be using port 22222, so let me know if that doesn't work for you.

bstick12 commented 7 years ago

@dsabeti The latest release looks to be 89 which was 20 days ago. The commits you linked to are after that. Will retest on the next release. For the moment we have overridden the metron_agent.health_port in our manifest and that is working.

Thanks for the quick reply