ERL-885: Calls to global:register_name hang in whole cluster for hours

OTP-Maintainer commented 5 years ago

Original reporter: rumataestor Affected version: OTP-19.3.6 Component: kernel Migrated from: https://bugs.erlang.org/browse/ERL-885

We are developing a distributed platform that consists a number of logically and physically separated applications.
Some of those applications use {{global:register_name}} to ensure only a single process associated with a particular value exists in a logically related part of the cluster.
We also use {{net_kernel:hidden_connect_node}} to prevent the global registry of one logical application from leaking to the rest of the cluster - to reduce synchronisation overhead and to encourage process discovery using RPC based API.

It worked very well for quite some time until a few days ago we experienced a serious outage when no node in the cluster could register a new process using {{global:register_name}}.

The first thing we noticed was that the "isolated to be" logical application did actually connect to the rest of the cluster due to a small bug – a connection to a node of a different application was done visibly.
I think that it shouldn’t be a problem and couldn’t cause the disaster as the names don't clash.

So I tried to find out why registration was not progressing, which I discovered by checking the {{registrar}} process attached to the {{global_name_server}}:

{noformat}

iex(pcp_agent@HCCPCPNODE326)22> pid(0,1935,0) |> Process.info(:backtrace) |> elem(1) |> IO.puts
Program counter: 0x00007fac8caf4d50 (global:random_sleep/1 + 392)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00007fac8a8cea70 Return addr 0x00007fac8cae9260 (global:set_lock_known/2 + 488)
y(0)     6420

0x00007fac8a8cea80 Return addr 0x00007fac8cae8f68 (global:trans_all_known/1 + 144)
y(0)     4046
y(1)     {global,<0.1935.0>}

0x00007fac8a8cea98 Return addr 0x00007fac8caf5c68 (global:loop_the_registrar/0 + 160)
y(0)     []
y(1)     []
y(2)     {global,<0.1935.0>}
y(3)     #Fun<global.1.112826559>

0x00007fac8a8ceac0 Return addr 0x00000000009383e8 (<terminate process normally>)
y(0)     {<0.15261.568>,#Ref<0.0.15990785.211922>}

{noformat}

As I see, the number {{4046}} refers to {{Times}} in https://github.com/erlang/otp/blob/master/lib/kernel/src/global.erl#L1007 is the number of times the lock was attempted to be acquired, and random_sleep makes every attempt after the 5th to wait for 0..8 ~ 4 seconds on average, which gives us 4046 * 4 / 3600 = 4.5 hours which is exactly the time the outage started.

The {{global_name_server}} did look like this:

{noformat}

iex(pcp_agent@HCCPCPNODE326)10> pid(0,1933,0) |> Process.info()
[
  registered_name: :global_name_server,
  current_function: {:gen_server, :loop, 6},
  initial_call: {:proc_lib, :init_p, 5},
  status: :waiting,
  message_queue_len: 0,
  messages: [],
  links: [#PID<0.1934.0>, #PID<0.1935.0>, #PID<0.1929.0>],
  dictionary: [
    {{:prot_vsn, :pcp_agent@HCCPCPNODE325}, 5},
    {{:prot_vsn, :pcp_agent@HCCPCPNODE032}, 5},
    {{:sync_tag_my, :pcp_agent@HCCPCPNODE032}, -576460752303423462},
    {{:sync_tag_my, :pcp_agent@HCCPCPNODE325}, -576460752303423461},
    {{:sync_tag_his, :pcp_agent@HCCPCPNODE032}, -576460752303423482},
    {:"$initial_call", {:global, :init, 1}},
    {{:sync_tag_his, :pcp_agent@HCCPCPNODE325}, -576460752303423488},
    {:"$ancestors", [:kernel_sup, #PID<0.1928.0>]}
  ],
  trap_exit: true,
  error_handler: :error_handler,
  priority: :normal,
  group_leader: #PID<0.1927.0>,
  total_heap_size: 32875,
  heap_size: 4185,
  stack_size: 9,
  reductions: 8450304,
  garbage_collection: [
    max_heap_size: %{error_logger: true, kill: true, size: 0},
    min_bin_vheap_size: 46422,
    min_heap_size: 233,
    fullsweep_after: 65535,
    minor_gcs: 7390
  ],
  suspending: []
]

iex(pcp_agent@HCCPCPNODE326)8> pid(0,1933,0) |> Process.info(:backtrace) |> elem(1) |> IO.puts
Program counter: 0x00007fad1b6562b0 (gen_server:loop/6 + 232)
CP: 0x0000000000000000 (invalid)
arity = 0

0x00007fac6dcadec8 Return addr 0x00007fad1b42cc48 (proc_lib:init_p_do_apply/3 + 56)
y(0)     []
y(1)     infinity
y(2)     global
y(3)     {state,true,['cirrus@HCCCIRRUS320','pcp_sms@HCCCIRRUS320','cirrus@HCCCIRRUS321','sipmiddleware@HCCSIP32007','sipmiddleware@HCCSIP32008','sipmiddleware@HCCSIP32009','sipmiddleware@HCCSIP32010','sipmiddleware@HCCSIP32001','sipmiddleware@HCCSIP32002','sipmiddleware@HCCSIP32003','sipmiddleware@HCCSIP32004','sipmiddleware@HCCSIP32005','sipmiddleware@HCCSIP32006','cirrus@HCCCIRRUST32','sipmiddleware_mgr@HCCSIP32001','sipmiddleware_mgr@HCCSIP32002','sipmiddleware_mgr@HCCSIP32003','drt@HCCDRT324','pcp_agent@HCCPCPNODE034','pcp_agent@HCCPCPNODE033','pcp-acd@HCCCIRRUSTEST31','cirrus@HCCCIRRUSTEST31','pcp_agent@HCCPCPNODE031','sipmiddleware_mgr@HCCSIP32007','sipmiddleware_mgr@HCCSIP32008','sipmiddleware_mgr@HCCSIP32009','sipmiddleware_mgr@HCCSIP03001','sipmiddleware_mgr@HCCSIP03003','sipmiddleware_mgr@HCCSIP03002','sipmiddleware@HCCSIP03003','cirrus@HCCCIRRUS301','sipmiddleware@HCCSIP03002','sipmiddleware@HCCSIP03001','sipmiddleware@HCCSIP03004'],
['sipmiddleware@HCCSIP03004','sipmiddleware@HCCSIP03001','sipmiddleware_mgr@HCCSIP03001','sipmiddleware@HCCSIP03003','sipmiddleware_mgr@HCCSIP03003','sipmiddleware@HCCSIP03002','sipmiddleware_mgr@HCCSIP03002','cirrus@HCCCIRRUS301','sipmiddleware_mgr@HCCSIP32009','sipmiddleware_mgr@HCCSIP32008','sipmiddleware_mgr@HCCSIP32007','pcp_agent@HCCPCPNODE031','cirrus@HCCCIRRUSTEST31','pcp-acd@HCCCIRRUSTEST31','sipmiddleware@HCCSIP32003','sipmiddleware@HCCSIP32010','pcp_agent@HCCPCPNODE033','sipmiddleware_mgr@HCCSIP32003','sipmiddleware@HCCSIP32006','sipmiddleware@HCCSIP32008','sipmiddleware_mgr@HCCSIP32002','sipmiddleware@HCCSIP32001','sipmiddleware@HCCSIP32009','pcp_sms@HCCCIRRUS320','sipmiddleware@HCCSIP32002','sipmiddleware@HCCSIP32005','sipmiddleware@HCCSIP32004','sipmiddleware@HCCSIP32007','sipmiddleware_mgr@HCCSIP32001','cirrus@HCCCIRRUST32','cirrus@HCCCIRRUS321','drt@HCCDRT324','pcp_agent@HCCPCPNODE034','cirrus@HCCCIRRUS320'],
[{'pcp_agent@HCCPCPNODE325',-576460752303423461,<0.5224.569>},{'pcp_agent@HCCPCPNODE032',-576460752303423462,<0.14686.568>}],[],'nonode@nohost',<0.1934.0>,<0.1935.0>,no_trace,false}
y(4)     global_name_server
y(5)     <0.1929.0>

0x00007fac6dcadf00 Return addr 0x00000000009383e8 (<terminate process normally>)
y(0)     Catch 0x00007fad1b42cc68 (proc_lib:init_p_do_apply/3 + 88)

{noformat}

I hardly believe that attempts to aquire a global lock from those 34 nodes could clash for 4046 times in a row.
I assumed that some node (or more than one) had acquired the global lock and never released if for whatever reason.

Then we attempted to disconnect the node from all the rest of the nodes and that immediately allowed the registar to progress with the registrations.
We attempted the same operations on other nodes of the cluster and after it was done on one of the nodes the problem disappeared on the rest.

Sadly we did this in a hurry so we didn’t take a snapshot of the global’s internal state before attempting to disconnect it from the cluster.

There is a possibility that some node hostname could not be resolved due to DNS settings or the resolved addres could not be connected due to firewall rules.
I attempted to reproduce those scenarios manually in my development cluster, but the cases I tried did show no hanging global lock and everything worked okay.

I tried to search for other reports of similar issues and found this: http://erlang.org/pipermail/erlang-questions/2014-August/080524.html with references to some even older reports.
I also found https://bugs.erlang.org/browse/OB-5 which mentions Distributed Hash Tables as an alternative to global.

After that I started to investigate the {{global}} implementation to find what might prevent the global lock from being unlocked and I learned quite a lot of how global is working, but wasn't able to identify the exact place leading to the problem we observed.

While reading the code I noticed the code hasn't been changed at all at least since R13 and does still contain support for R7 nodes.
Considering the OB-5 ticket I thought of a possibility to modernize it and drop some unnecessary code.

My colleague has suggested that name registration might use the name being registered as a lock identifier instead of a global lock, so other names could progress if one name got stuck for whatever reason.

After considering it and re-reading the global implementation on the way the global lock is acquired, I realised that even name-specific lock may not be needed for the name registration process - the name could be registered on the nodes exactly the same way the global lock is acquired, and those strong guarantees provided by globally locked exchange step are not so necessary during the racy period of node interconnections.

Do you know of these or similar issues with {{global}} (maybe in your internal issue tracker)?
If yes, are they acted upon or going to be acted upon soon (months)?

Are you ready to accept patches significantly changing {{global}} internal implementation?
If yes, how backwards compatible they need to be?

OTP-Maintainer commented 4 years ago

hasse said:

Hi,

Thank you for the detailed bug report, and sorry for the long delay.

> Are you ready to accept patches significantly changing global internal implementation?

I'm afraid we cannot take the risk which is implied by a significant
 rewrite.

> Do you know of these or similar issues with global (maybe in your internal issue tracker)?
 > If yes, are they acted upon or going to be acted upon soon (months)?

We do see occasional hick-ups in our daily builds, and we act upon
 them as much as time permits. We are currently trying to pinpoint the
 cause of these hick-ups.

There have been a couple of bug fixes of the emulator during the past
 few years (I don't have a explicit list of them; they are mentioned in
 release notes). You're using quite an old release (19). Have you
 upgraded since the bug report? If so, have you seen any improvements?

Best regards,

Hans Bolinder, Erlang/OTP team, Ericsson

OTP-Maintainer commented 4 years ago

rumataestor said:

I just checked the changes in `global` module and see there were 2 changes which appeared first in OTP-22.0, so it worthwhile checking.

I understand the risks of significant rewrite, however after that old investigation I noticed a number of confusing approaches which make reasoning of how `global` works very difficult. And I think the code would benefit from some refactoring. The new changes don't improve those parts much.

Here are some things I remember from the old investigation and consider problematic:
 * circular dependencies between `global` and `global_groups` modules,
 * `global_name_server` sending messages to `registrar` process, which calls `global_name_server` - kind of circular dependency between classes of processes,
 * global lock preventing any registration while nodes are synchronising, which may take significant time depending on the amount of nodes and data in them.

I'm not sure how often `global` is used in real projects, but existence of `gproc`, `gen_leader`, `swarm`, `horde` create impression that people don't choose `global` because they have some reasons. Not sure what they are but I'd like to be able to say "you don't need any of those - `global` is good enough", but right now I'm afraid it doesn't get enough attention and is not polished enough to be used in production.

uabboli commented 3 years ago

I'm not sure, but it's likely that GH-4912 fixes this particular problem. One bug was corrected, and now the test suite hasn't failed in our daily builds for quite some time.

RumataEstor commented 1 year ago

Sorry for raising this issue again but I suppose it might need to be reopened.

I had a new case of a similar problem yesterday in a completely different project which doesn't use any hidden nodes but does use "auto connect" and "prevent overlapping partitions"... This time the cluster consists of just 13 nodes, although they are executed in Kubernetes pods and occasionally they seem to be rescheduled to different hosts... Unfortunately this time I forgot to follow what I described here and I'm not sure the registrar showed similar values.

I tried to unblock global using the approach found in RabbitMQ https://github.com/rabbitmq/rabbitmq-server/commit/fba455c61c0b82f291b72bc05cc8199b8dbdae5c but it didn't help. I tried to compare the versions attached to {sync_tag_his, PeerNode} on each node with {sync_tag_my, ThisNode} on the peer nodes and found that they were all equal on all of them.

Then I noticed that the global_name_server had quite a few of {pending, PeerNode} entries. After I used net_kernel:disconnect_node(PeerNode) for all such nodes, global got unlocked and continued to operate normally.

While trying to repeat this operation on the rest of the nodes, I noticed that "prevent overlapping partitions" kicked in and started disconnecting other nodes... As we use kubernetes API to discover the nodes which should be connected and libcluster to connect them all together, the nodes were reconnected back so the "overlapping partitions prevention" resulted in 2366 disconnections between just 10 different nodes:

As I'm sure the nodes could actually connect to each other, I wonder if you think "prevent overlapping partitions" worked correctly in this case?

Anyway, after I disconnected all the peer nodes found as {pending, PeerNode} in global_name_server on each of the node, the cluster resumed its work correctly.

Does this help in diagnosing what is actually misbehaving? I'm planning to add some diagnostic code before the fix would be applied automatically, so please let me know what parts of the state(s) I should dump for futher research?

erlang / otp

ERL-885: Calls to global:register_name hang in whole cluster for hours #3923