Snapshot TTL expiry causes Primary to be marked unhealthy

tgross commented 8 years ago

@misterbisson has reported that the on_change handler of replicas appear to be firing spuriously after a long period of operation, and this causes a failover even when the primary appears as though it should be healthy.

This appears to be a bug in the way we're marking the time for the snapshot in Consul. We can reproduce a minimal test case as follows.

We'll stand up a Consul server and a Consul agent container; the agent container is running under ContainerPilot and sends a trivial healthcheck for a service named "mysql" to Consul. The onChange handler will ask Consul for the JSON blob associated with the current status of the service, so that we can get it in the logs. Run the targets as follows:

docker run -d -p 8500:8500 --name consul_server \
       progrium/consul:latest \
       -server -bootstrap -ui-dir /ui

docker run -it -p 8501:8500 --name consul_agent \
       -v $(pwd)/containerpilot.json:/etc/containerpilot.json \
       -v $(pwd)/on_change.sh:/on_change.sh \
       --link consul_server:consul \
       autopilotpattern/mysql \
       /usr/local/bin/containerpilot \
       /usr/local/bin/consul agent -data-dir=/data -config-dir=/config \
                                   -rejoin -retry-join consul -retry-max 10 \
                                   -retry-interval 10s

The minimal containerpilot config we're binding into the agent is:

{
  "consul": "localhost:8500",
  "logging": {
    "level": "DEBUG"
  },
  "services": [
    {
      "name": "mysql",
      "port": 3306,
      "health": "echo health",
      "poll": 2,
      "ttl": 5
    }
  ],
  "backends": [
    {
      "name": "mysql",
      "poll": 7,
      "onChange": "curl -s http://localhost:8500/v1/health/service/mysql"
    }
  ]
}

We'll then register a new health check with the agent for the backup, and mark it passing once:

# push new check to agent
docker exec -it consul_agent \
       curl -s -d '{"Name": "backup", "TTL": "10s"}' \
       http://localhost:8500/v1/agent/check/register

# push pass TTL to agent
docker exec -it consul_agent \
       curl -v -s http://localhost:8500/v1/agent/check/pass/backup

After 10 seconds, the TTL for "backup" will expire and the on_change handler will fire!

2016/09/23 18:59:31 health
    2016/09/23 18:59:31 [WARN] agent: Check 'backup' missed TTL, is now critical
    2016/09/23 18:59:31 [INFO] agent: Synced check 'backup'
2016/09/23 18:59:33 mysql.health.RunWithTimeout start
2016/09/23 18:59:33 mysql.health.Cmd.Start
2016/09/23 18:59:33 mysql.health.run waiting for PID 62:
2016/09/23 18:59:33 health
2016/09/23 18:59:33 mysql.health.run complete
2016/09/23 18:59:33 mysql.health.RunWithTimeout end
2016/09/23 18:59:33 mysql.health.RunWithTimeout start
2016/09/23 18:59:33 mysql.health.Cmd.Start
2016/09/23 18:59:33 mysql.health.run waiting for PID 63:
2016/09/23 18:59:33 [{"Node":{"Node":"1a9e1eb44133","Address":"172.17.0.3","TaggedAddresses":null,"CreateIndex":0,"ModifyIndex":0},"Service":{"ID":"mysql-1a9e1eb44133","Service":"mysql","Tags":null,"Address":"172.17.0.3","Port":3306,"EnableTagOverride":false,"CreateIndex":0,"ModifyIndex":0},"Checks":[{"Node":"1a9e1eb44133","CheckID":"mysql-1a9e1eb44133","Name":"mysql-1a9e1eb44133","Status":"passing","Notes":"TTL for mysql set by containerpilot","Output":"ok","ServiceID":"mysql-1a9e1eb44133","ServiceName":"mysql","CreateIndex":0,"ModifyIndex":0},{"Node":"1a9e1eb44133","CheckID":"serfHealth","Name":"Serf Health Status","Status":"passing","Notes":"","Output":"Agent alive and reachable","ServiceID":"","ServiceName":"","CreateIndex":0,"ModifyIndex":0},{"Node":"1a9e1eb44133","CheckID":"backup","Name":"backup","Status":"critical","Notes":"","Output":"TTL expired","ServiceID":"","ServiceName":"","CreateIndex":0,"ModifyIndex":0}]}]

When this happens we can check the status of the mysql service with curl -s http://localhost:8500/v1/health/service/mysql | jq . and see the following:

[
  {
    "Node": {
      "Node": "1a9e1eb44133",
      "Address": "172.17.0.3"
    },
    "Service": {
      "ID": "mysql-1a9e1eb44133",
      "Service": "mysql",
      "Tags": null,
      "Address": "172.17.0.3",
      "Port": 3306
    },
    "Checks": [
      {
        "Node": "1a9e1eb44133",
        "CheckID": "mysql-1a9e1eb44133",
        "Name": "mysql-1a9e1eb44133",
        "Status": "passing",
        "Notes": "TTL for mysql set by containerpilot",
        "Output": "ok",
        "ServiceID": "mysql-1a9e1eb44133",
        "ServiceName": "mysql"
      },
      {
        "Node": "1a9e1eb44133",
        "CheckID": "serfHealth",
        "Name": "Serf Health Status",
        "Status": "passing",
        "Notes": "",
        "Output": "Agent alive and reachable",
        "ServiceID": "",
        "ServiceName": ""
      },
      {
        "Node": "1a9e1eb44133",
        "CheckID": "backup",
        "Name": "backup",
        "Status": "critical",
        "Notes": "",
        "Output": "TTL expired",
        "ServiceID": "",
        "ServiceName": ""
      }
    ]
  }
]

Unfortunately this isn't a new bug, but splitting the snapshot from the health check seems to have revealed it, particularly as we've started running this blueprint in situations that were a bit closer to real-world use like autopilotpattern/wordpress.

The root problem is that when we register a check it's not being bound to a particular service and so when it fails the check the entire node is being marked as unhealthy. We can bind the check to a particular "ServiceID", but this means we'll need to have some kind of "dummy service" for backups. Given the low frequency for this check I'm actually just going to swap the check out for reading the last snapshot time directly from the kv store rather than adding this kind of complexity.

In my reproduction above I've also run into what appears to be a ContainerPilot bug that I didn't think was causing missed health checks but turns out it is. This is https://github.com/joyent/containerpilot/issues/178#issuecomment-249282259 so I'm going to hop on that ASAP.

tgross commented 8 years ago

Here's the design change I'm going to make to cover this bug, in the manage.py snapshot_task:

Attempt to obtain a local file lock on the snapshot process. (Exit on fail.)
Replica nodes ask Consul for the value of the LAST_BACKUP key. (This value will be a timestamp.) Continue only if the difference between that value and the current time is more than the snapshot interval.
Attempt to obtain a lock on the BACKUP_LOCK key in Consul for running the snapshot, with a TTL equal to the BACKUP_TTL. Exit on fail. If the node gets the lock:
- snapshot the DB
- push to Manta
- write the current timestamp to the LAST_BACKUP key in Consul
- release the BACKUP_LOCK key from Consul
Release the local file lock

tgross commented 8 years ago

The above was implemented in https://github.com/autopilotpattern/mysql/issues/61

tgross commented 8 years ago

Released in https://github.com/autopilotpattern/mysql/releases/tag/5.6r3.1.0

autopilotpattern / mysql

Snapshot TTL expiry causes Primary to be marked unhealthy #59