bitwalker / libcluster

Automatic cluster formation/healing for Elixir applications
MIT License
1.95k stars 187 forks source link

A proposal of enhancement to the epmd strategy #168

Closed qhwa closed 1 year ago

qhwa commented 3 years ago

Discussed in https://github.com/bitwalker/libcluster/discussions/167

Originally posted by **qhwa** September 23, 2021 Hi, first of all, thanks for the awesome library! It's a fundamental part of our Elixir application in Kubernetes. Right now libcluster has [the epmd strategy](https://github.com/bitwalker/libcluster/blob/master/lib/strategy/epmd.ex) which works just fine. However, in our scenarios, we want to have more features upon it: 1. The whole application can only be started successfully when the remote node is connected, which means the current node has joined the cluster. We want our application to initialize only after successfully joining the cluster, like how [`sync_nodes_mandatory`](https://erlang.org/doc/design_principles/distributed_applications.html) works in Erlang. 2. After the application starts, if the remote node disconnects, we want `libcluster` to retry connecting it in the background, in order to reconnect when the node comes back online again. I created a strategy based on the epmd strategy to achieve that. The strategy is merely a module: ```elixir defmodule ClusterStrategy.ContactNodeStrategy do alias Cluster.Strategy.State use Cluster.Strategy require Logger @default_retry_delay 5_000 @default_max_retries 12_000 @impl true def start_link(opts) do GenServer.start_link(__MODULE__, opts, name: __MODULE__) end def init([%State{config: config} = state]) do case config[:hosts] do [_ | _] = nodes -> for node <- nodes, do: connect!(node, state) {:ok, state} [] -> :ignore nil -> :ignore end end defp connect!(node, state) when is_atom(node) do :ok = connect(node, state) true = Node.monitor(node, true) node end defp connect(node, state) when is_atom(node), do: Cluster.Strategy.connect_nodes(state.topology, state.connect, state.list_nodes, [node]) def handle_info({:nodedown, node}, state) do Logger.error("Node down: #{node}") spawn_link(fn -> retry_connect(node, state, max_retries(state)) end) {:noreply, state} end defp retry_connect(node, state, retries_left) when is_integer(retries_left) and retries_left >= 0 do case connect(node, state) do :ok -> :ok {:error, _} -> :timer.sleep(retry_delay(state)) retry_connect(node, state, retries_left - 1) end end defp retry_connect(node, state, _), do: raise( "Can't connect to the contact node #{node} after #{max_retries(state)} retries, terminating." ) defp max_retries(%{config: config}), do: Keyword.get(config, :max_retries, @default_max_retries) defp retry_delay(%{config: config}), do: Keyword.get(config, :retry_delay, @default_retry_delay) end ``` I'd like to hear from you if this is a good idea. If yes, is there a way we can contribute it back to the library?
BrendanBall commented 2 years ago

We've just moved from peerage to libcluster and were surprised to find out that the epmd strategy doesn't attempt to reconnect. I think the epmd strategy needs to support failure modes and reconnect.

johanerikssonpingpayments commented 2 years ago

I can only agree. We can write our own strategy based on the existing one, but it would be a nice feature to have out of the box.

bitwalker commented 2 years ago

I'm open to improving the existing strategy, but time is always hard to find. If you have an existing implementation that you are using that you think is a strict improvement over the current one, and are willing to contribute that work, I'll certainly make time to get such a PR merged ASAP. I'll review the proposed implementation in the OP this week, so if that covers your use case, you can disregard the above.

qhwa commented 2 years ago

I'm open to improving the existing strategy, but time is always hard to find. If you have an existing implementation that you are using that you think is a strict improvement over the current one, and are willing to contribute that work, I'll certainly make time to get such a PR merged ASAP. I'll review the proposed implementation in the OP this week, so if that covers your use case, you can disregard the above.

Thanks for the reply @bitwalker I'll try to find a slot to make a PR for it.

bitwalker commented 1 year ago

183 has been merged, so this will be fixed in the next release (going out today)