Closed OTP-Maintainer closed 3 years ago
hasse
said:
Hi,
Thank you for the detailed bug report, and sorry for the long delay.
> Are you ready to accept patches significantly changing global internal implementation?
I'm afraid we cannot take the risk which is implied by a significant
rewrite.
> Do you know of these or similar issues with global (maybe in your internal issue tracker)?
> If yes, are they acted upon or going to be acted upon soon (months)?
We do see occasional hick-ups in our daily builds, and we act upon
them as much as time permits. We are currently trying to pinpoint the
cause of these hick-ups.
There have been a couple of bug fixes of the emulator during the past
few years (I don't have a explicit list of them; they are mentioned in
release notes). You're using quite an old release (19). Have you
upgraded since the bug report? If so, have you seen any improvements?
Best regards,
Hans Bolinder, Erlang/OTP team, Ericsson
rumataestor
said:
I just checked the changes in `global` module and see there were 2 changes which appeared first in OTP-22.0, so it worthwhile checking.
I understand the risks of significant rewrite, however after that old investigation I noticed a number of confusing approaches which make reasoning of how `global` works very difficult. And I think the code would benefit from some refactoring. The new changes don't improve those parts much.
Here are some things I remember from the old investigation and consider problematic:
* circular dependencies between `global` and `global_groups` modules,
* `global_name_server` sending messages to `registrar` process, which calls `global_name_server` - kind of circular dependency between classes of processes,
* global lock preventing any registration while nodes are synchronising, which may take significant time depending on the amount of nodes and data in them.
I'm not sure how often `global` is used in real projects, but existence of `gproc`, `gen_leader`, `swarm`, `horde` create impression that people don't choose `global` because they have some reasons. Not sure what they are but I'd like to be able to say "you don't need any of those - `global` is good enough", but right now I'm afraid it doesn't get enough attention and is not polished enough to be used in production.
I'm not sure, but it's likely that GH-4912 fixes this particular problem. One bug was corrected, and now the test suite hasn't failed in our daily builds for quite some time.
Sorry for raising this issue again but I suppose it might need to be reopened.
I had a new case of a similar problem yesterday in a completely different project which doesn't use any hidden nodes but does use "auto connect" and "prevent overlapping partitions"... This time the cluster consists of just 13 nodes, although they are executed in Kubernetes pods and occasionally they seem to be rescheduled to different hosts... Unfortunately this time I forgot to follow what I described here and I'm not sure the registrar
showed similar values.
I tried to unblock global using the approach found in RabbitMQ https://github.com/rabbitmq/rabbitmq-server/commit/fba455c61c0b82f291b72bc05cc8199b8dbdae5c but it didn't help. I tried to compare the versions attached to {sync_tag_his, PeerNode}
on each node with {sync_tag_my, ThisNode}
on the peer nodes and found that they were all equal on all of them.
Then I noticed that the global_name_server
had quite a few of {pending, PeerNode}
entries. After I used net_kernel:disconnect_node(PeerNode)
for all such nodes, global got unlocked and continued to operate normally.
While trying to repeat this operation on the rest of the nodes, I noticed that "prevent overlapping partitions" kicked in and started disconnecting other nodes... As we use kubernetes API to discover the nodes which should be connected and libcluster to connect them all together, the nodes were reconnected back so the "overlapping partitions prevention" resulted in 2366 disconnections between just 10 different nodes:
As I'm sure the nodes could actually connect to each other, I wonder if you think "prevent overlapping partitions" worked correctly in this case?
Anyway, after I disconnected all the peer nodes found as {pending, PeerNode}
in global_name_server
on each of the node, the cluster resumed its work correctly.
Does this help in diagnosing what is actually misbehaving? I'm planning to add some diagnostic code before the fix would be applied automatically, so please let me know what parts of the state(s) I should dump for futher research?
Original reporter:
rumataestor
Affected version:OTP-19.3.6
Component:kernel
Migrated from: https://bugs.erlang.org/browse/ERL-885