coreos / fleet

fleet ties together systemd and etcd into a distributed init system
Apache License 2.0
2.42k stars 302 forks source link

CoreOs cluster restarted all containers due to fleet or etcd errors #1725

Open ghost opened 7 years ago

ghost commented 7 years ago

Hello We just saw a pretty server issue on our production CoreOs setup. Details are:

Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:157: Establishing etcd connectivity Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:179: Engine leadership acquisition failed: context deadline exceeded Jan 17 21:59:41 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:168: Starting server components Jan 17 21:59:42 ip-10-26-31-100.ec2.internal fleetd[999]: INFO engine.go:185: Engine leadership acquired Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(kafka-broker-1.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR reconciler.go:62: Failed resolving task: task={Type: UnscheduleUnit, JobName: kafka-broker-1.service, MachineID: 6ca65ead2f164b2682c0d941c Jan 17 21:59:44 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(newNewApps.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded

[Service] User=etcd PermissionsStartOnly=true Environment=ETCD_DATA_DIR=/var/lib/etcd Environment=ETCD_NAME=%m ExecStart=/usr/bin/etcd Restart=always RestartSec=10s LimitNOFILE=40000

/run/systemd/system/etcd.service.d/10-oem.conf

[Service] Environment=ETCD_PEER_ELECTION_TIMEOUT=1200

/run/systemd/system/etcd.service.d/20-cloudinit.conf

[Service] Environment="ETCD_ADDR=10.26.33.251:4001" Environment="ETCD_CERT_FILE=/home/etcd/certs/cert.crt" Environment="ETCD_DISCOVERY=https://discovery.etcd.io/" Environment="ETCD_KEY_FILE=/home/etcd/certs/key.pem" Environment="ETCD_PEER_ADDR=10.26.33.251:7001"

etcd-10-26-31-100.txt etcd-10-26-32-94.txt etcd-10-26-33-251.txt fleet-10-26-31-100.txt fleet-10-26-32-94.txt fleet-10-26-33-251.txt

Appreciate if someone can take a look at the above and give us any pointers on what to look at and what we can do to mitigate this.

I opened a fleet ticket - https://github.com/coreos/etcd/issues/7177 and was redirected to here

Thx Maulik etcd-10-26-31-100.txt etcd-10-26-32-94.txt etcd-10-26-33-251.txt fleet-10-26-31-100.txt fleet-10-26-32-94.txt fleet-10-26-33-251.txt

ghost commented 7 years ago

Following up - we have done the below:

Jan 18 01:39:04 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps.service) in Registry: context deadline exceeded Jan 18 01:39:05 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: context deadline exceeded Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:157: Establishing etcd connectivity Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:168: Starting server components Jan 18 01:39:49 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps.service) in Registry: context deadline exceeded Jan 18 01:39:50 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: context deadline exceeded Jan 18 01:40:05 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat Jan 18 01:40:05 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:157: Establishing etcd connectivity Jan 18 01:40:18 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:168: Starting server components Jan 18 01:40:20 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO engine.go:79: Engine leader is 6ca65ead2f164b2682c0d941c8a75d9b Jan 18 01:40:34 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint