flightstats / hub

fault tolerant, highly available service for data storage and distribution
http://www.flightstats.com
MIT License
103 stars 35 forks source link

Stop periodically looking for empty webhook leader zk nodes, just delete them when we delete webhooks #1142

Closed lkemmerer closed 5 years ago

lkemmerer commented 5 years ago

As part of the zookeeper cleanup, I created a job that would go through and find webhook leader nodes without locks and leases. This executed the same code that had originally been running whenever a node started up.

It turned out the code doesn't do what I thought it did. After reading the logs, it looks like due to the lifecycle of ZK webhook leader locks, sometimes an active webhook meets the requirements in the existing code and the node was getting deleted (without releasing leadership) and then recreated shortly afterwards, occasionally causing two nodes to think they were leaders.

So I reverted the periodic cleanup and the strange code will go back to only executing on startup (Commit 1). In commit 2, I solve the same problem in a (hopefully) safer way, by adding webhook leader ZK node deletion to part of the process of deleting webhook state when we delete webhooks...

I think this fixes the main issue that was causing the frequent duplicate webhook item issue that was introduced earlier. The work I have in another branch around curator locks and putting a lock around the WebhookLeader node will (probably) sew up the other duplicate issues that had been seen previously.

chriskessel commented 5 years ago

:heart: