I have been considering using distributed application mechanism for a project, however it seems very unlikely we use this feature. I bet we are not alone, since it's got a major flaw preventing the adoption on cloud-based set ups: it does not recover at all from network splits.
This is, in fact, well known behavior. See:
http://erlang.org/pipermail/erlang-questions/2014-November/081791.html
and
http://learnyousomeerlang.com/distributed-otp-applications states
"If you deem netsplits more likely than hardware failures, then you have to be aware of the possibility that the application is running both as a backup and main one, and that funny things could happen when the network issue is resolved. Maybe distributed OTP applications aren't the right mechanism for you in these cases."
I have ran some tests, and it is precisely what happens indeed. The takeover does not take effect - contrary to situation when one node goes down and then is started again.
While netsplit may be rare when you run cluster on co-located servers, sitting in the same data center, it is more likely when you have a cloud deployment. Considering how popular cloud deployments are these days, this renders Erlang's built in mechanism for distributed applications unsuitable for these cloud based scenarios.
I propose there's either an update to existing dist_ac module needed, or a second version of such controller is required to allow people setting up fairly simple clusters of Erlang nodes, with failover and takeover, which recovers from instances of network disconnections.
Original reporter:
hubertlepicki
Affected version:Not Specified
Component:kernel
Migrated from: https://bugs.erlang.org/browse/ERL-515