bitwalker / swarm

Easy clustering, registration, and distribution of worker processes for Erlang/Elixir
MIT License
1.21k stars 103 forks source link

target module not available on remote node #122

Open alexferreira opened 5 years ago

alexferreira commented 5 years ago

I have a service calculator and the investigator running through the libcluster.

but when I register my service, sometimes something strange happens.

for example:

When I start the calculator service it returns me the following message.

[warn] [swarm on calculator@127.0.0.1] [tracker: start_pid_remotely] "a475b420-e5f8-4528-9b72-766b7e75d177" could not be started on investigator@127.0.0.1: target module not available on remote node, retrying operation after 1000ms ..

and in the investigator service I get the following return

[warn] [swarm on investigator@127.0.0.1] [tracker: do_track] ** (UndefinedFunctionError) function Calculator.Supervisor.register / 1 is undefined (module Calculator.Supervisor is not available)
    Calculator.Supervisor.register ("a475b420-e5f8-4528-9b72-766b7e75d177")
    (swarm) lib / swarm / tracker / tracker.ex: 1082: Swarm.Tracker.do_track / 2
    (stdlib) gen_statem.erl: 1660:: gen_statem.call_state_function / 5
    (stdlib) gen_statem.erl: 1023:: gen_statem.loop_event_state_function / 6
    (stdlib) proc_lib.erl: 249:: proc_lib.init_p_do_apply / 3

but if I try to start it sometimes it works without problems.

can anybody help me?

arjan commented 5 years ago

To me this looks like if you have a cluster with heterogenous OTP apps. For swarm to work, the OTP application that you are going distribute processes for (e.g. with Swarm.register_name/4) need to be available on all the nodes participating in the swarm cluster.

alexferreira commented 5 years ago

@arjan this is happening soon after running Swarm.register_name/4 only after a few times it works.

arjan commented 5 years ago

So are the same OTP applications started on both nodes?

alexferreira commented 5 years ago

yes the same applications were started in both nos.

It's working right now. however if I stop one of the applications and start again many times the problem mentioned above happens.

arjan commented 5 years ago

Do you mean stopping the node or just stopping the application? (Application.stop)? Maybe the cluster is already formed before all application code is loaded, and tracker requests come in already, however I cannot imagine that this takes very long...

alexferreira commented 5 years ago

in this first gif as you can see I started the applications and soon came the error quoted

swarm

in the second gif as you can see the error does not happen.

swarm1

bitwalker commented 5 years ago

The problem seems to be that the second node is still loading code when Swarm on the first node tells Swarm on the second node to start a process (resulting in the crash, because the code isn't loaded yet). This is happening because when running with Mix, applications and their code are loaded and started sequentially, while in a release, all application code is first loaded, then applications are started.

My guess is that Mix starts Swarm before it starts the part of the system which invokes register_name, so Swarm on the second node starts and is able to communicate with the first node and accept registration requests before the code for the registration callback is loaded - since this is inherently racy, that's why it works only some of the time.

@arjan @beardedeagle Until we get the refactoring implemented so that Swarm can be started under the supervision tree rather than as its own tree, we could provide a configuration option which allows specifying an application that needs to be started before Swarm will start serving requests, and then basically just loop until the application status (via :application_controller.info/0) shows that it is started. Thoughts? The refactor is really the fix, but having a short term solution to this would be nice.

alexferreira commented 5 years ago

@bitwalker I circumvented the situation using the dynamicSypervisor.

beardedeagle commented 5 years ago

@bitwalker I think that's a workable temp solution, though I'd take it a step further and allow it to accept a list of applications.