coreos / fleet

fleet ties together systemd and etcd into a distributed init system
Apache License 2.0
2.42k stars 302 forks source link

Failed getting response from http://127.0.0.1:4001/: cancelled #1250

Open Vinceveve opened 9 years ago

Vinceveve commented 9 years ago

7 machines cluster, 3 mains etcd, 4 etcd proxies, all running on 4001 and 2379

On all machine I'm getting this

fleetd[24283]: INFO client.go:292: Failed getting response from http://127.0.0.1:4001/: cancelled

Fleet agent is working : reconcile, schedule ...

But fleetctl on many requests send :

Error retrieving list of units from repository: googleapi: Error 503: fleet server unable to communicate with etcd

OR

2015/06/15 14:29:50 WARN fleetctl.go:772: Error retrieving Unit(worker.collector@x3.service) from Registry: Get http://domain-sock/fleet/v1/units/worker.collector%40x3.service?alt=json: forwarding request denied

Etcd is fine on all machines and answers fine when I query it during the request denied.

Any idea ?

tclavier commented 9 years ago

I have same pb and when i put fleet in debug mode i have : https://gist.github.com/tclavier/3aa1cb84a229aedb53ab

And when i put value into key space with etcdctl that work fine.

ghost commented 9 years ago

I was getting a lot of the same messages, increasing the etcd_request_timeout made them go away for me.

bcwaldon commented 9 years ago

@tclavier is it a timeout issue? Are you able to reproduce the error if you raise the etcd_request_timeout in both fleetctl and fleetd?

tclavier commented 9 years ago

For last test i use

Errors are same :

fleet server unable to communicate with etcd
SkyWriter commented 9 years ago

Looks like I've got the same issue on my side. I keep on getting INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled messages, and at some point Fleet randomly restarts units:

06:38:48 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:38:48 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:38:48 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:38:48 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:39:45 host fleetd[23463]: WARN job.go:272: No Unit found in Registry for Job(consul-registrator.service)
06:39:45 host fleetd[23463]: ERROR job.go:109: Failed to parse Unit from etcd: unable to parse Unit in Registry at key /_coreos.com/fleet/job/consul-registrator.service/object
06:39:45 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:39:46 host fleetd[23463]: INFO manager.go:138: Triggered systemd unit consul-registrator.service stop: job=11845869
06:39:46 host fleetd[23463]: INFO manager.go:259: Removing systemd unit consul-registrator.service
06:39:47 host fleetd[23463]: INFO manager.go:182: Instructing systemd to reload units
06:39:52 host fleetd[23463]: INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit job=consul-registrator.service reason="unit loaded but not scheduled here"
06:39:52 host fleetd[23463]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
06:39:52 host fleetd[23463]: WARN reconcile.go:59: AgentReconciler completed reconciliation in 8.692721762s
06:39:52 host fleetd[23463]: INFO manager.go:246: Writing systemd unit consul-registrator.service (618b)
06:39:52 host fleetd[23463]: INFO manager.go:182: Instructing systemd to reload units
06:39:55 host fleetd[23463]: INFO manager.go:127: Triggered systemd unit consul-registrator.service start: job=11846043
06:39:55 host fleetd[23463]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=consul-registrator.service reason="unit scheduled here but not loaded"
06:39:55 host fleetd[23463]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
06:39:55 host fleetd[23463]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=consul-registrator.service reason="unit currently loaded but desired state is launched"
06:40:03 host fleetd[23463]: WARN reconcile.go:59: AgentReconciler completed reconciliation in 6.414672939s
06:40:30 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:40:30 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:40:30 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:40:30 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:40:45 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:40:45 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled
06:40:45 host fleetd[23463]: INFO client.go:292: Failed getting response from http://localhost:2379/: cancelled

Running:

Timeouts for etcd2:

And for fleet:

Happens on just one out of three machines in the cluster.

tadas-subonis commented 9 years ago

Same here.

Sep 03 14:49:06 coreos-web-7 fleetd[601]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Sep 03 14:49:06 coreos-web-7 fleetd[601]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Sep 03 14:50:51 coreos-web-7 fleetd[601]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Sep 03 14:50:51 coreos-web-7 fleetd[601]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Sep 03 14:50:51 coreos-web-7 fleetd[601]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Sep 03 14:51:21 coreos-web-7 fleetd[601]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Sep 03 14:51:21 coreos-web-7 fleetd[601]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled
Sep 03 14:51:21 coreos-web-7 fleetd[601]: INFO client.go:292: Failed getting response from http://localhost:4001/: cancelled

Also excessive CPU and RAM usage. CPU goes from 10%to40% and ram is around 15% out of 2GB

bronger commented 9 years ago

I got rid of the first error message of the original poster by fixing the dependencies in my unit files: I had made them dependent on etcd although etcd2 is running.

I get the second error message, for example,

2015/09/05 13:07:45 WARN fleetctl.go:797: Error retrieving Unit(memcached.service) from Registry: Get http://domain-sock/fleet/v1/units/memcached.service?alt=json: forwarding request denied

only if I try to start a new unit using fleetctl through a tunnel, with the unit file residing on the host system. Executing it within the VM that's running CoreOS is fine.

l15k4 commented 9 years ago

I have the same issue as @bronger on Vagrant... Trying to load services via --tunnel leads to infinite hanging with these warnings :

Error retrieving Unit(es-data@1.service) from Registry: Get http://domain-sock/fleet/v1/units/es-data%401.service?alt=json: forwarding request denied

This is how I start up Vagrant : https://github.com/l15k4/coreos-es/blob/master/bin/start-vagrant.sh#L47

And then I'm just loading and starting services :

https://github.com/l15k4/coreos-es/blob/master/bin/start-vagrant.sh#L91

Using fleetctl --debug=true --strict-host-key-checking=false.

bweston92 commented 8 years ago

@bronger any update as to why you got that error? I'm getting a similar one too.

$ fleetctl start app-dashboard@{1..4}
2016/01/21 10:29:12 WARN fleetctl.go:799: Error retrieving Unit(app-dashboard@1.service) from Registry: Get http://domain-sock/fleet/v1/units/app-dashboard%401.service?alt=json: forwarding request denied
2016/01/21 10:29:13 WARN fleetctl.go:799: Error retrieving Unit(app-dashboard@2.service) from Registry: Get http://domain-sock/fleet/v1/units/app-dashboard%402.service?alt=json: forwarding request denied
2016/01/21 10:29:13 WARN fleetctl.go:799: Error retrieving Unit(app-dashboard@1.service) from Registry: Get http://domain-sock/fleet/v1/units/app-dashboard%401.service?alt=json: forwarding request denied
2016/01/21 10:29:13 WARN fleetctl.go:799: Error retrieving Unit(app-dashboard@3.service) from Registry: Get http://domain-sock/fleet/v1/units/app-dashboard%403.service?alt=json: forwarding request denied
Unit app-dashboard@4.service inactive
Unit app-dashboard@2.service inactive
Unit app-dashboard@1.service inactive
Unit app-dashboard@3.service inactive
2016/01/21 10:29:14 WARN fleetctl.go:799: Error retrieving Unit(app-dashboard@2.service) from Registry: Get http://domain-sock/fleet/v1/units/app-dashboard%402.service?alt=json: forwarding request denied
2016/01/21 10:29:14 WARN fleetctl.go:799: Error retrieving Unit(app-dashboard@3.service) from Registry: Get http://domain-sock/fleet/v1/units/app-dashboard%403.service?alt=json: forwarding request denied
2016/01/21 10:29:14 WARN fleetctl.go:799: Error retrieving Unit(app-dashboard@4.service) from Registry: Get http://domain-sock/fleet/v1/units/app-dashboard%404.service?alt=json: forwarding request denied

$ fleetctl -version
fleetctl version 0.11.5
bronger commented 8 years ago

I abandoned fleet for my workflow, so I cannot help on this.

brandonkruse commented 8 years ago

@bweston92 We handled this by changing the timeout settings. See issue 1289 and specifically this comment: https://github.com/coreos/fleet/issues/1289#issuecomment-146424680