Closed austin-zhu closed 4 years ago
Nerve could provide the flexibility to register as other different types,
curious about why you might want different types of ZK nodes.
@igor47
Nerve could provide the flexibility to register as other different types,
curious about why you might want different types of ZK nodes.
Using a persistent node can reduce the write storm caused by a high number of concurrent session expirations (during ZK degradation, for example). Currently, with ephemeral nodes, as soon as a session expires, the ephemeral nodes will be removed. Nerve "detects" this by ping failures, and then automatically restarts itself. This causes all those nodes to be re-created, which is a high number of concurrent write transactions and this can worsen problems on ZK.
Using persistent nodes can reduce that write storm because the nodes do not need to be recreated. Also, we are looking to use this for dual-writing persistent nodes to a secondary cluster
with ephemeral nodes, as soon as a session expires, the ephemeral nodes will be removed.
curious, if nerve
fails or the instance where nerve
is running looses connectivity to the ZK cluster, what happens to the persistent
node? it should be cleaned up... are you planning to have a separate cleaning task to remove stale nodes? are you changing the semantics of registration in ZK from "this node is currently online and able to handle traffic" to "this node existed at some point, but there are no claims to it's current status" ?
with ephemeral nodes, as soon as a session expires, the ephemeral nodes will be removed.
curious, if
nerve
fails or the instance wherenerve
is running looses connectivity to the ZK cluster, what happens to thepersistent
node? it should be cleaned up... are you planning to have a separate cleaning task to remove stale nodes? are you changing the semantics of registration in ZK from "this node is currently online and able to handle traffic" to "this node existed at some point, but there are no claims to it's current status" ?
Correct, the current idea is to have a cleaning task when using persistent
nodes. The semantics are somewhat changing, with the key idea being that service discovery can tolerate a superset of healthy nodes for a limited amount of time. The data plane (HAProxy/Envoy) can filter out unhealthy nodes.
with ephemeral nodes, as soon as a session expires, the ephemeral nodes will be removed.
curious, if
nerve
fails or the instance wherenerve
is running looses connectivity to the ZK cluster, what happens to thepersistent
node? it should be cleaned up... are you planning to have a separate cleaning task to remove stale nodes? are you changing the semantics of registration in ZK from "this node is currently online and able to handle traffic" to "this node existed at some point, but there are no claims to it's current status" ?
@igor47 it is a very good callout. TBH, we plan to verify the graceful shutdown behavior to make sure whether nerve will remove the node proactively when terminating properly. It seems it does but need to confirm again, check here.
I have to admit that one of the biggest advantages of ephemeral node is to avoid stale data, even if client doesn't clean it up properly. However, as @panchr mentioned earlier, in our case, it will be used for a secondary backup so that the side effect of stale data is acceptable, as we try to avoid primary and secondary ZK cluster failing in the same pattern. The client side proxy should be able to filter out stale data anyway if doing active health check.
Meanwhile, it seems that latest ZK server 3.5.6
supports TTL based node, which could be the eventual solution for this issue. However, the ruby ZK client doesn't support it yet and updating ZK server is risky too. Therefore, the stale data introduced by persistent node is somehow inevitable. Using another cron job to periodically clean it up could mitigate the issue but not ideal.
Overall, I think such a flexibility is still useful and the default behavior should still be ephemeral_sequential
node type.
Summary
Currently, nerve only registers to ZK as the hard-coded type of
:ephemeral_sequential
. Nerve could provide the flexibility to register as other different types, for example,:persistent
type.To keep back compatibility, the type is still
:ephemeral_sequential
if not specified.NOTE: when using
persistent-*
type, it might cause stale data in ZK if the graceful shutdown doesn't invoke or failed executing.Reviewers
@anson627 @panchr
Test
Unit tests
Create a service with different node types to register, they are
N/A
,'ephemeral'
,'persistent'
,'ephemeral_sequential'
andpersistent_sequential
, respectivelyVerify that nodes were created to ZK successfully
Verify that the node type is desired
Cut the connection to ZK by modifying iptable. Verify that the
ephemeral-*
nodes are removed after session timeout (5 secs), but notpersistent-*
node.Verify when gracefully shutdown, all types of nodes will be removed.