Open alexferreira opened 5 years ago
To me this looks like if you have a cluster with heterogenous OTP apps.
For swarm to work, the OTP application that you are going distribute processes for (e.g. with Swarm.register_name/4
) need to be available on all the nodes participating in the swarm cluster.
@arjan this is happening soon after running Swarm.register_name/4 only
after a few times it works.
So are the same OTP applications started on both nodes?
yes the same applications were started in both nos.
It's working right now. however if I stop one of the applications and start again many times the problem mentioned above happens.
Do you mean stopping the node or just stopping the application? (Application.stop
)?
Maybe the cluster is already formed before all application code is loaded, and tracker requests come in already, however I cannot imagine that this takes very long...
in this first gif as you can see I started the applications and soon came the error quoted
in the second gif as you can see the error does not happen.
The problem seems to be that the second node is still loading code when Swarm on the first node tells Swarm on the second node to start a process (resulting in the crash, because the code isn't loaded yet). This is happening because when running with Mix, applications and their code are loaded and started sequentially, while in a release, all application code is first loaded, then applications are started.
My guess is that Mix starts Swarm before it starts the part of the system which invokes register_name
, so Swarm on the second node starts and is able to communicate with the first node and accept registration requests before the code for the registration callback is loaded - since this is inherently racy, that's why it works only some of the time.
@arjan @beardedeagle Until we get the refactoring implemented so that Swarm can be started under the supervision tree rather than as its own tree, we could provide a configuration option which allows specifying an application that needs to be started before Swarm will start serving requests, and then basically just loop until the application status (via :application_controller.info/0
) shows that it is started. Thoughts? The refactor is really the fix, but having a short term solution to this would be nice.
@bitwalker I circumvented the situation using the dynamicSypervisor.
@bitwalker I think that's a workable temp solution, though I'd take it a step further and allow it to accept a list of applications.
I have a service
calculator
and theinvestigator
running through thelibcluster
.but when I register my service, sometimes something strange happens.
for example:
When I start the
calculator
service it returns me the following message.and in the
investigator
service I get the following returnbut if I try to start it sometimes it works without problems.
can anybody help me?