Open jeanfabrice opened 8 years ago
To me there seems to be a race condition when fleet is starting after reboot. It seems to try to start services before all the unit files are loaded. Extracting the relevant parts of the logs from @jeanfabrice - following the grafana.service and sidekick grafana-announce.service.
Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit grafana.service (1061b) Jun 29 06:24:24 coreos41 fleetd[3547]: INFO manager.go:182: Instructing systemd to reload units
Jun 29 06:24:24 coreos41 fleetd[3547]: ERROR manager.go:129: Failed to trigger systemd unit grafana.service start: Unit grafana-announce.service not found.
Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=grafana.service reason="unit scheduled here but not loaded" Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files" Jun 29 06:24:25 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit job=grafana.service reason="unit currently loaded but desired state is launched"
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO manager.go:246: Writing systemd unit grafana-announce.service (386b)
Jun 29 06:24:26 coreos41 fleetd[3547]: INFO manager.go:182: Instructing systemd to reload units Jun 29 06:24:26 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit job=grafana-announce.service reason="unit scheduled here but not loaded" Jun 29 06:24:26 coreos41 fleetd[3547]: INFO reconcile.go:330: AgentReconciler completed task: type=ReloadUnitFiles job=N/A reason="always reload unit files"
I see the same behavior with sidekick services where both services are needed. I run latest stable coreos (1122.3.0) shipped with fleet 0.11.7. I have followed the guide on https://www.digitalocean.com/community/tutorials/how-to-create-flexible-services-for-a-coreos-cluster-with-fleet-unit-files to set up sidekick services (adjusted with corrected variable names in the x-fleet sections.)
@jeanfabrice have you started or just loaded the grafana-announce.service with fleet, i.e. is the DSTATE for grafana.announce.service loaded or launched when you run fleetctl list-unit-files
? To me it seems possible to circumvent this problem by also telling fleet to start the sidekick and not just load it.
@cskarby Thanks for the heads up. Apart from workaround, I think fleet should be able to avoid such races. I'll try to look into that.
First we need to check if it's already fixed. There was a similar issue https://github.com/coreos/fleet/issues/998, which was fixed by https://github.com/coreos/fleet/pull/1647. That fix is already merged, and it's going to be included in the future release v1.0. I would like to know, if this issue got resolved or would happen less frequently, when testing with the current master branch.
If the issue would be still there with master branch, we need to think about other possibilities.
Hi,
I'm trying to have my containers survive a node failure, being rescheduled and restarted on surviving nodes in a 3-nodes Coreos beta channel cluster (1068.3.0).
I'm facing the following issue when shutting down a node member: Some containers get randomly restarted, some others don't. According to the fleet log, the root cause seems to be that the corresponding sidekick unit is not available on disk when Fleet decides to start a service unit.
Here is a unit and its sidekick counterpart (I'm using -announce to suffix the sidekick unit name) :
And here is the fleet log:
I have read many issues about similar behaviours, but all are now closed and seem to be related to Fleet v0.9.1. I'm running the following :
Are there some misconfiguration in my units ?