coreos / fleet

fleet ties together systemd and etcd into a distributed init system
Apache License 2.0
2.42k stars 303 forks source link

Random failure to submit units when etcd endpoint not specified #1650

Open waddles opened 7 years ago

waddles commented 7 years ago

I have built a new environment with a cluster of 3 etcd servers and a build server running etcd in proxy mode:

root@tree:~# /usr/local/sbin/etcd -version
etcd Version: 2.3.3
Git SHA: c41345d
Go Version: go1.6.2
Go OS/Arch: linux/amd64
root@tree:~# /usr/local/sbin/etcd --proxy on \
 -initial-cluster elf=http://10.1.1.3:2380,dwarf=http://10.1.1.4:2380,halfling=http://10.1.1.5:2380 \
 -listen-client-urls http://172.17.42.1:2379,http://127.0.0.1:2379

In the cluster, starting with a clean registry, I run etcd (tried several versions including 2.3.3) using relevant options on all 3 servers:

root@elf:~# rm -fr /var/lib/etcd2/
root@elf:~# /tmp/etcd-2.3.6/etcd -initial-cluster-token mytoken \
 -initial-cluster elf=http://10.1.1.3:2380,dwarf=http://10.1.1.4:2380,halfling=http://10.1.1.5:2380 \
 -initial-cluster-state new -initial-advertise-peer-urls http://10.1.1.3:2380 \
 -advertise-client-urls http://10.1.1.3:2379 \
 -listen-client-urls http://127.0.0.1:2379,http://10.1.1.3:2379 \
 -listen-peer-urls http://127.0.0.1:2380,http://10.1.1.3:2380 \
 -data-dir /var/lib/etcd2 -name elf \
 -heartbeat-interval 600 -election-timeout 6000 -debug > /tmp/etcd.log 2>&1

And wait a few seconds for it to become healthy:

# etcdctl --debug --endpoints=http://10.1.1.3:2379 cluster-health
Cluster-Endpoints: http://10.1.1.3:2379
cURL Command: curl -X GET http://10.1.1.3:2379/v2/members
member b5fa9b282334905d is healthy: got healthy result from http://10.1.1.4:2379
member bdd359f77713beb4 is healthy: got healthy result from http://10.1.1.3:2379
member d45ce0c430449c65 is healthy: got healthy result from http://10.1.1.5:2379
cluster is healthy

Then I try to submit a unit from my build server using fleetctl:

root@tree:/tmp/fleet-v0.11.7-linux-amd64# while :; do 
 ./fleetctl --debug submit lb_feedcache_1002@.service && sleep 1;
 ./fleetctl --endpoint=http://10.1.1.3:2379 --driver=etcd --debug destroy lb_feedcache_1002@.service; sleep 1;
 done
2016/07/26 12:45:05 DEBUG http.go:28: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json
2016/07/26 12:45:05 DEBUG http.go:31: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json 404 Not Found
2016/07/26 12:45:05 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:45:05 DEBUG http.go:28: HTTP PUT http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json
2016/07/26 12:45:05 DEBUG http.go:31: HTTP PUT http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json 201 Created
2016/07/26 12:45:05 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
2016/07/26 12:45:05 DEBUG http.go:28: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json
2016/07/26 12:45:05 DEBUG http.go:31: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json 200 OK
Unit lb_feedcache_1002@.service inactive
Destroyed lb_feedcache_1002@.service
2016/07/26 12:45:07 DEBUG http.go:28: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json
2016/07/26 12:45:07 DEBUG http.go:31: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json 404 Not Found
2016/07/26 12:45:07 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:45:07 DEBUG http.go:28: HTTP PUT http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json
2016/07/26 12:45:07 DEBUG http.go:31: HTTP PUT http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json 201 Created
2016/07/26 12:45:07 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
2016/07/26 12:45:07 DEBUG http.go:28: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json
2016/07/26 12:45:07 DEBUG http.go:31: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json 500 Internal Server Error
2016/07/26 12:45:07 WARN fleetctl.go:799: Error retrieving Unit(lb_feedcache_1002@.service) from Registry: googleapi: got HTTP response code 500 with body: {"error":{"code":500,"message":""}}
2016/07/26 12:45:08 DEBUG http.go:28: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json
2016/07/26 12:45:08 DEBUG http.go:31: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json 500 Internal Server Error
2016/07/26 12:45:08 WARN fleetctl.go:799: Error retrieving Unit(lb_feedcache_1002@.service) from Registry: googleapi: got HTTP response code 500 with body: {"error":{"code":500,"message":""}}
2016/07/26 12:45:08 DEBUG http.go:28: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json
2016/07/26 12:45:08 DEBUG http.go:31: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json 500 Internal Server Error
2016/07/26 12:45:08 WARN fleetctl.go:799: Error retrieving Unit(lb_feedcache_1002@.service) from Registry: googleapi: got HTTP response code 500 with body: {"error":{"code":500,"message":""}}
2016/07/26 12:45:09 DEBUG http.go:28: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json
2016/07/26 12:45:09 DEBUG http.go:31: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json 500 Internal Server Error
2016/07/26 12:45:09 WARN fleetctl.go:799: Error retrieving Unit(lb_feedcache_1002@.service) from Registry: googleapi: got HTTP response code 500 with body: {"error":{"code":500,"message":""}}
2016/07/26 12:45:09 DEBUG http.go:28: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json
2016/07/26 12:45:10 DEBUG http.go:31: HTTP GET http://domain-sock/fleet/v1/units/lb_feedcache_1002%40.service?alt=json 500 Internal Server Error
2016/07/26 12:45:10 WARN fleetctl.go:799: Error retrieving Unit(lb_feedcache_1002@.service) from Registry: googleapi: got HTTP response code 500 with body: {"error":{"code":500,"message":""}}
^C^C

As above, it first accepted the unit then on 2nd try, it went into a loop reporting response code 500 from the cluster, yet the keys were correctly created in etcd. See https://gist.github.com/waddles/0e121d46c0499eaaef9685eef06120f0 for more info.

HOWEVER

I see quite a different output and no errors when I specify an endpoint in the cluster to send to (otherwise it is the same command):

root@tree:/tmp/fleet-v0.11.7-linux-amd64# while :; do
 ./fleetctl --endpoint=http://10.1.1.4:2379 --debug submit lb_feedcache_1002@.service && sleep 1;
 ./fleetctl --endpoint=http://10.1.1.3:2379 --driver=etcd --debug destroy lb_feedcache_1002@.service; sleep 1;
 done
2016/07/26 12:51:29 DEBUG fleetctl.go:274: Defaulting to --driver=etcd as --endpoint appears to be etcd
2016/07/26 12:51:29 DEBUG fleetctl.go:612: Found Unit(lb_feedcache_1002@.service) in Registry, no need to recreate it
2016/07/26 12:51:29 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
Destroyed lb_feedcache_1002@.service
2016/07/26 12:51:31 DEBUG fleetctl.go:274: Defaulting to --driver=etcd as --endpoint appears to be etcd
2016/07/26 12:51:31 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:51:31 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
Unit lb_feedcache_1002@.service inactive
Destroyed lb_feedcache_1002@.service
2016/07/26 12:51:33 DEBUG fleetctl.go:274: Defaulting to --driver=etcd as --endpoint appears to be etcd
2016/07/26 12:51:33 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:51:33 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
Unit lb_feedcache_1002@.service inactive
Destroyed lb_feedcache_1002@.service
2016/07/26 12:51:35 DEBUG fleetctl.go:274: Defaulting to --driver=etcd as --endpoint appears to be etcd
2016/07/26 12:51:35 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:51:35 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
Unit lb_feedcache_1002@.service inactive
Destroyed lb_feedcache_1002@.service
2016/07/26 12:51:37 DEBUG fleetctl.go:274: Defaulting to --driver=etcd as --endpoint appears to be etcd
2016/07/26 12:51:37 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:51:37 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
Unit lb_feedcache_1002@.service inactive
Destroyed lb_feedcache_1002@.service
2016/07/26 12:51:39 DEBUG fleetctl.go:274: Defaulting to --driver=etcd as --endpoint appears to be etcd
2016/07/26 12:51:39 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:51:39 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
Unit lb_feedcache_1002@.service inactive
Destroyed lb_feedcache_1002@.service
2016/07/26 12:51:41 DEBUG fleetctl.go:274: Defaulting to --driver=etcd as --endpoint appears to be etcd
2016/07/26 12:51:41 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:51:41 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
Unit lb_feedcache_1002@.service inactive
Destroyed lb_feedcache_1002@.service
2016/07/26 12:51:43 DEBUG fleetctl.go:274: Defaulting to --driver=etcd as --endpoint appears to be etcd
2016/07/26 12:51:43 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:51:43 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
Unit lb_feedcache_1002@.service inactive
Destroyed lb_feedcache_1002@.service
2016/07/26 12:51:45 DEBUG fleetctl.go:274: Defaulting to --driver=etcd as --endpoint appears to be etcd
2016/07/26 12:51:45 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:51:45 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
Unit lb_feedcache_1002@.service inactive
Destroyed lb_feedcache_1002@.service
2016/07/26 12:51:47 DEBUG fleetctl.go:274: Defaulting to --driver=etcd as --endpoint appears to be etcd
2016/07/26 12:51:47 DEBUG fleetctl.go:494: Unit(lb_feedcache_1002@.service) found in local filesystem
2016/07/26 12:51:47 DEBUG fleetctl.go:583: Created Unit(lb_feedcache_1002@.service) in Registry
Unit lb_feedcache_1002@.service inactive
Destroyed lb_feedcache_1002@.service
^C
  1. I never seem to get a failure when the endpoint is specified in the submit command.
  2. I can specify --endpoint=http://127.0.0.1:2379 to force it to use the local etcd proxy and that also succeds reliably.
  3. We also have a similar setup using older versions and legacy ports 4001 and 7001 which does not exhibit this behaviour.
waddles commented 7 years ago

Until we find a fix, I have made a workaround script in /usr/local/bin/fleetctl

#!/bin/bash
/usr/bin/fleetctl --driver=etcd --endpoints=http://127.0.0.1:2379 "$@"

In testing that though, I dropped the -heartbeat-interval 600 -election-timeout 6000 options from the etcd cluster and one of the peers reported a clock sync error > 1s so I resync-ed the clocks with ntp. Only 1 failed submission out of several hundred since then so clock sync may play a role in it too.