This fixes a bug where, under certain restart + race-condition circumstances, all webhook state for a webhook in zookeeper was being destroyed when a webhook was stopped on a node. This caused some batch channel data loss when we did planned hub downtime -- and possibly, maybe during other times. It's hard to know.
The fix is basically to ensure that the only webhook-related data changed in Zookeeper when a node stops a webhook is that the node is no longer running the webhook. All other state should remain (i.e. the last completed info, which items are currently being delivered, and what items have errored out).
While I was researching this bug, I realized that we mix up and misuse some verbs related to webhook actions. Those actions in particular are "delete," "stop," and "remove." I changed a bunch of names to help align with the difference. "Stop" now refers to stopping a webhook from running on a node. "Delete" refers to both deleting webhook state info in Zookeeper and actually deleting the webhook configuration from dynamo, which I think always go hand-in-hand.
Behavior around stopping and deleting should look something like this:
Turning off a node should STOP a webhook and leave all other state as-is.
The following should STOP the webhook from running, DELETE the webhook state in ZK, and then DELETE the webhook configuration in dynamo:
Deleting a webhook
Updating a webhook cursor (which should then recreate and restart with the new latest item).
Changing a BOTH channel to SINGLE, which deletes the internal batch webhook we use to zip items and ship them to S3.
Changing a channel that uses replication to not use replication, which deletes the intenral replication webhook.
Deleting a BATCH or BOTH channel OR a replicated channel, which deletes the related internal webhook.
That's a lot of words. Hopefully they make sense to you and future me. Please reach out if you think talking IRL about the changes would be helpful!
This fixes a bug where, under certain restart + race-condition circumstances, all webhook state for a webhook in zookeeper was being destroyed when a webhook was stopped on a node. This caused some batch channel data loss when we did planned hub downtime -- and possibly, maybe during other times. It's hard to know.
The fix is basically to ensure that the only webhook-related data changed in Zookeeper when a node stops a webhook is that the node is no longer running the webhook. All other state should remain (i.e. the last completed info, which items are currently being delivered, and what items have errored out).
While I was researching this bug, I realized that we mix up and misuse some verbs related to webhook actions. Those actions in particular are "delete," "stop," and "remove." I changed a bunch of names to help align with the difference. "Stop" now refers to stopping a webhook from running on a node. "Delete" refers to both deleting webhook state info in Zookeeper and actually deleting the webhook configuration from dynamo, which I think always go hand-in-hand.
Behavior around stopping and deleting should look something like this:
That's a lot of words. Hopefully they make sense to you and future me. Please reach out if you think talking IRL about the changes would be helpful!