Open jacob-netguardians opened 1 week ago
PR could be created (sorry, I do not have that sort of right, still a newbie here) using the following branch: https://github.com/jacob-netguardians/helix/tree/bugfix/2965-atomic-znode-creation-on-node-registration
Describe the bug
When two nodes are registered very closely to eachother, it can happen that the ZKHelixAdmin.addInstance() method is called at about the same time on the two nodes. The one that comes slightly before the other one will start creating ZNodes for the "INSTANCE" hierarchy in Zookeeper.
Let's say the second node starts the run of the ZKHelixAdmin.addInstance() method now. Its call to findInstancesWithMatchingLogicalId() will fail badly, throwing an exception (see below), because the first node's INSTANCE hierarchy is not complete:
So... while adding node-2, it complains about node-1 being invalid, and consequently sets the "instance-operation" on node-2 to UNKNOWN instead of ENABLE.
Version of helix (1.4.1.1) above is due to the fact that I had already fixed #2963 and had a local version number change for this. Line numbers in the logs above may also not exactly match what is known in version 1.4.1, due to the addition of debug logging I had to do before actually understanding the problem.
Expected behavior
A node should be registered completely or not at all, not having that kind of impact on the fellow nodes' registration.
Additional context
I propose the following fix, in the ZKHelixAdmin class, creating all those nodes in the node's INSTANCE hierarchy in an atomic fashion:
Replace, in the ZKHelixAdmin class, addInstance() method, this:
with
where
createPersistentNodeOp
method is defined as: