derekkraan / horde

Horde is a distributed Supervisor and Registry backed by DeltaCrdt
MIT License
1.32k stars 106 forks source link

Add hook to manually control redistribution and a cast message to manually trigger redistribution. #253

Open veedo opened 2 years ago

veedo commented 2 years ago

The allow_handoff function determines if the hand off is allowed to proceed when the node wants to eject a process. This has the benefit of providing maximum flexibility for the super users that need it, but the function signatures feel simple. An added benefit is that the user can manually trigger redistribution with a different function if the situation calls for it. It's implemented as an option list in the handoff so that it can be further extended in the future.

I finally got around to working on this as we discussed in https://github.com/derekkraan/horde/pull/198 and https://github.com/derekkraan/horde/issues/197.

The customization seems to work really well in my project. I can look at the source/destination nodes and the child spec and make more complex decisions about whether the redist should proceed.

veedo commented 2 years ago

Tests pass when running locally. Seems like a timing issue in the test that failed on that build.

veedo commented 1 year ago

@derekkraan Any issues/feedback? I can try a different style if it smells bad :stuck_out_tongue:

derekkraan commented 1 year ago

Hi @veedo,

Thanks for the PR. Sorry it took me so long to get to it.

I'm not so sure about the redistribute_children/2 function. Can you give me an example of when you would use this in your code?

veedo commented 1 year ago

The simplest example is just to delay redistribution to a quiet time on the network/device.

Task.start(fn ->
  Process.sleep(60_000)
  for sup <- supervisors, do: DynamicSupervisor.redistribute_children(sup)
end)

This still allows new processes to be Added when the node starts up and those will get load balanced. Those processes already being load balanced will reduce the amount of re-balancing that must occur later.

A more non trivial example is closer to our application. I am using, or plan to use, the allow handoff function for a few purposes:

example of keeping some processes together: The node affinity is baked into the child_spec

def keep_process_on_best_nodes({current_node, chosen_node, _, _}) when current_node == chosen_node do
  true
end
def keep_process_on_best_nodes({_current_node, chosen_node, child_spec, _child_pid}) do
  best_nodes = Locality.get_nodes(child_spec)
  best_nodes_all_dead = Enum.all?(best_nodes, &(&1 not in [node() | Node.list()]))
  (chosen_node in best_nodes) or best_nodes_all_dead
end

...

DynamicSupervisor.redistribute_children(sup, &keep_process_on_best_nodes/1)