flightstats / hub

fault tolerant, highly available service for data storage and distribution
http://www.flightstats.com
MIT License
103 stars 35 forks source link

Fix a data loss bug related to restarting nodes and inappropriately deleting state in ZK #1240

Closed lkemmerer closed 4 years ago

lkemmerer commented 4 years ago

This fixes a bug where, under certain restart + race-condition circumstances, all webhook state for a webhook in zookeeper was being destroyed when a webhook was stopped on a node. This caused some batch channel data loss when we did planned hub downtime -- and possibly, maybe during other times. It's hard to know.

The fix is basically to ensure that the only webhook-related data changed in Zookeeper when a node stops a webhook is that the node is no longer running the webhook. All other state should remain (i.e. the last completed info, which items are currently being delivered, and what items have errored out).

While I was researching this bug, I realized that we mix up and misuse some verbs related to webhook actions. Those actions in particular are "delete," "stop," and "remove." I changed a bunch of names to help align with the difference. "Stop" now refers to stopping a webhook from running on a node. "Delete" refers to both deleting webhook state info in Zookeeper and actually deleting the webhook configuration from dynamo, which I think always go hand-in-hand.

Behavior around stopping and deleting should look something like this:

That's a lot of words. Hopefully they make sense to you and future me. Please reach out if you think talking IRL about the changes would be helpful!