Latency in dispatching the registration of a GenServer

mszmurlo commented 3 years ago

Hi,

I'm not sure whether this is a real issue or a bug of mine because of a misunderstanding on how Horde works. I posted that question on Stack Overflow; the question got upvoted a couple of times but I got no answer so I believe I'm not the only one who misunderstood something.

Basically, I would have thought that calling Horde.DynamicSupervisor.start_child(...) was synchronous, that is it returns only once all dispatching on all nodes of the cluster was done, as it is supposed to have the same behavior as the standard DynamicSupervisor. However, I've observed the following:

If I run the application on some nodes with empty registries (actually 2 nodes are sufficient to show the issue)
If I call Horde.DynamicSupervisor.start_child(myGenServer) on node N1 and immediately after I call an API on that GenServer
And is that GenServer had been physically started on the other node, N2,

Then

Horde.Registry.lookup(via_tuple(id)) doesn't find it before about 250ms on average. My timing measures a range between 150ms to 350, but of course this depends on the system. The GenServer becomes available in the registry after.

If Horde.DynamicSupervisor.start_child(...) is asynchronous, that's annoying as we don't know when it will return so I ended by implementing a function get_pid() which actually waits until the GenServer is available (but I consider this as quite dirty)

def get_pid(_id, _wait, waited, max_wait) when waited > max_wait, do: :error

def get_pid(id, wait, waited, max_wait) do
  Logger.debug("get_pid(#{id}, wait=#{wait}, waited=#{waited}, max_wait=#{max_wait})")
  case Horde.Registry.lookup(via_tuple(id)) do  
    [{pid, _}] ->
      Logger.debug("\t-> got pid=#{inspect pid}")
      pid
    [] ->
      Logger.debug("\t-> got []. Going for a nap for #{wait}ms")
      Process.sleep(wait)
      get_pid(id, wait*2, waited+wait, max_wait)
  end
end

Here comes an example of what I get on the log when I call a simple ping function:

[debug] UserAgent.ping(rid="60321c53d6ca3fac318b2da9")
[debug] get_pid(60321c53d6ca3fac318b2da9, wait=10, waited=0, max_wait=5000)
[debug]        -> got []. Going for a nap for 10ms
[debug] get_pid(60321c53d6ca3fac318b2da9, wait=20, waited=10, max_wait=5000)
[debug]        -> got []. Going for a nap for 20ms
[debug] get_pid(60321c53d6ca3fac318b2da9, wait=40, waited=30, max_wait=5000)
[debug]        -> got []. Going for a nap for 40ms
[debug] get_pid(60321c53d6ca3fac318b2da9, wait=80, waited=70, max_wait=5000)
[debug]        -> got []. Going for a nap for 80ms
[debug] get_pid(60321c53d6ca3fac318b2da9, wait=160, waited=150, max_wait=5000)
[debug]        -> got []. Going for a nap for 160ms
[debug] get_pid(60321c53d6ca3fac318b2da9, wait=320, waited=310, max_wait=5000)
[debug]        -> got pid=#PID<25303.596.0>

which shows the client had been waiting 310ms to get the pid. That's a long latenecy, right ?

If the call to start_child() is supposed to be synchronous, well, either horde has a bug as it returns before all dispatch had been done or, well, I have config error or a bug but I've spent a week on this without being able to spot it.

Just to give you an idea of what I'm doing:

The application is an HTTP backend where each connected user it mapped to a UserAgent, UA, (implemented as a GenServer).
When an HTTP API is called, I check whether there is an UA with the corresponding session token. This token encrypts the key used in the registry. If so, the user is connected
The problem arises in the login() controller where I create the UA and need to update it's state:

    ## Call start_child to create the UA
    res = Horde.DynamicSupervisor.start_child(MyApi.DSup, {MyApi.UserAgent, user})     
    case res do
      {:ok, _ua_pid} ->
    stok = Utils.mk_session_token(user)
    rid = Utils.mk_registry_key(user)
    #UserAgent.ping(rid)

         ## Updating the state with the session token but Boom
    UserAgent.add_token(rid, stok)

    resp = %{"ret" => "C0", "tok" => stok}
    json(conn, resp)

      {:error, {:already_started, _ua_pid}} ->
      # Handle the case where the user is already connected

Cheers

RealVidy commented 3 years ago

I just got bitten by the same issue.

I couldn't find a genserver I had just started when it was on a different node as the caller. I had no issue if the caller and the callee were on the same node, or if the callee was a long running process started a while ago. The replication latency explains it, and I was able to confirm that this is the issue by simply waiting a bit after starting the remote genserver.

RealVidy commented 3 years ago

Actually, since we get the pid of a child after starting it, we can manually register its name and it solves our issue:

case Horde.DynamicSupervisor.start_child(
       MyHordeSupervisor,
       {__MODULE__, my_args}
     ) do
  {:ok, pid} ->
    Horde.Registry.register_name(
      {MyHordeRegistry, my_name},
      pid
    )

    :ok
  {:error, error} ->
    {:error, error}
end

derekkraan commented 3 years ago

@mszmurlo This is how Horde works, which is documented here: https://hexdocs.pm/horde/eventual_consistency.html#content

mszmurlo commented 3 years ago

@mszmurlo This is how Horde works, which is documented here: https://hexdocs.pm/horde/eventual_consistency.html#content

@derekkraan Right. It's just that the delay seems very important. It would be very useful at least to have a choice between "synchronous" registration that exits only once the information had been dispatched on all nodes and the current version where we can't be sure that all nodes are aware of the info (but I believe most of the time that's not important)

derekkraan commented 3 years ago

@mszmurlo I understand that you might find this useful, but Horde is not built this way, and I have no plans at this moment to support this feature. I also don't think it's quite as useful as you think it is, as most problems that people experience are related to new requests coming in while the information is still propagating across the cluster, and race conditions associated with that (can not find process, but when starting process, all of a sudden the information has propagated and now you get an error because the process can be found).

I will also note that you can adjust the sync_interval by passing the :delta_crdt_options option, documented here: https://hexdocs.pm/horde/Horde.DynamicSupervisor.html#t:option/0, if 250ms or whatever the default is, is not fast enough for you.

derekkraan / horde

Latency in dispatching the registration of a GenServer #223