Closed tgross closed 8 years ago
Here's the design change I'm going to make to cover this bug, in the manage.py snapshot_task
:
LAST_BACKUP
key. (This value will be a timestamp.) Continue only if the difference between that value and the current time is more than the snapshot interval.BACKUP_LOCK
key in Consul for running the snapshot, with a TTL equal to the BACKUP_TTL
. Exit on fail. If the node gets the lock:
LAST_BACKUP
key in ConsulBACKUP_LOCK
key from ConsulThe above was implemented in https://github.com/autopilotpattern/mysql/issues/61
@misterbisson has reported that the
on_change
handler of replicas appear to be firing spuriously after a long period of operation, and this causes a failover even when the primary appears as though it should be healthy.This appears to be a bug in the way we're marking the time for the snapshot in Consul. We can reproduce a minimal test case as follows.
We'll stand up a Consul server and a Consul agent container; the agent container is running under ContainerPilot and sends a trivial healthcheck for a service named "mysql" to Consul. The
onChange
handler will ask Consul for the JSON blob associated with the current status of the service, so that we can get it in the logs. Run the targets as follows:The minimal containerpilot config we're binding into the agent is:
We'll then register a new health check with the agent for the backup, and mark it passing once:
After 10 seconds, the TTL for "backup" will expire and the
on_change
handler will fire!When this happens we can check the status of the mysql service with
curl -s http://localhost:8500/v1/health/service/mysql | jq .
and see the following:Unfortunately this isn't a new bug, but splitting the snapshot from the health check seems to have revealed it, particularly as we've started running this blueprint in situations that were a bit closer to real-world use like autopilotpattern/wordpress.
The root problem is that when we register a check it's not being bound to a particular service and so when it fails the check the entire node is being marked as unhealthy. We can bind the check to a particular "ServiceID", but this means we'll need to have some kind of "dummy service" for backups. Given the low frequency for this check I'm actually just going to swap the check out for reading the last snapshot time directly from the kv store rather than adding this kind of complexity.
In my reproduction above I've also run into what appears to be a ContainerPilot bug that I didn't think was causing missed health checks but turns out it is. This is https://github.com/joyent/containerpilot/issues/178#issuecomment-249282259 so I'm going to hop on that ASAP.