Open anthony-gonzalez-kantox opened 2 months ago
the pool worker map
Infinitomata.all(…)
is empty when a deployment is done
Does it stay empty forever?
Why exactly is necessary to get a poolworker from that map to be able to publish?
Because publishing is being done with a pool to make it able to publish from different nodes using different processes when the publishing queue is full.
I cannot resolve stuff like “sometimes the stuff fails,” please share logs at least.
Does it stay empty forever?
Yes
** (Enum.EmptyError) empty error
(elixir 1.16.2) lib/enum.ex:2395: Enum.random/1
(finitomata 0.25.0) lib/finitomata/pool.ex:200: Finitomata.Pool.run/3
(app 0.3.0) lib/app/finitomata/project.ex:288: MyApp.function/3
(app 0.3.0) lib/app/finitomata/project.ex:206: MyApp.function/1
(app 0.3.0) lib/app/finitomata/project.ex:186: anonymous fn/2 in MyApp.Finitomata.LivePair.on_transition/4
(elixir 1.16.2) lib/enum.ex:987: Enum."-each/2-lists^foreach/1-0-"/2
(app 0.3.0) lib/app/finitomata/project.ex:185: MyApp.Finitomata.Project.on_transition/4
(app 0.3.0) deps/finitomata/lib/finitomata.ex:1231: MyApp.Finitomata.Project.safe_on_transition/5
Well, this code should not raise anyway.
I did a blind fix to retry on empty pool with a tiny timeout. It’s in main
of finitomata
until mox
is finally upgraded.
Finitomata
v0.26.0
has been released, possibly closing this.
It didn't solve the problem, but I found more logs that could narrow down the problem.
Task #PID<0.2250.0> started from #PID<0.2240.0> terminating
** (stop) exited in: GenServer.call(Finitomata.Rambla.Handlers.Amqp.DefinedHandler.Infinitomata.IdLookup, {:update, #Function<10.104737341/1 in Finitomata.Distributed.Supervisor.synch/2>}, 5000)
** (EXIT) an exception was raised:
** (FunctionClauseError) no function clause matching in anonymous fn/3 in Finitomata.Distributed.Supervisor.synch/2
(finitomata 0.26.0) lib/finitomata/distributed/supervisor.ex:64: anonymous fn("PoolWorker_1", %{node: :"app@10.3.0.70", pid: nil, ref: #Reference<54373.2114719762.490209281.55285>}, %{node: :"app@10.3.2.195", pid: #PID<54372.2252.0>, ref: #Reference<54372.3183318594.3711172609.118569>}) in Finitomata.Distributed.Supervisor.synch/2
(stdlib 5.2.2) maps.erl:199: :maps.merge_with_1/4
(elixir 1.16.2) lib/enum.ex:2528: Enum."-reduce/3-lists^foldl/2-0-"/3
(elixir 1.16.2) lib/agent/server.ex:23: Agent.Server.handle_call/3
(stdlib 5.2.2) gen_server.erl:1131: :gen_server.try_handle_call/4
(stdlib 5.2.2) gen_server.erl:1160: :gen_server.handle_msg/6
(stdlib 5.2.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
(elixir 1.16.2) lib/gen_server.ex:1114: GenServer.call/3
(elixir 1.16.2) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
Function: #Function<16.104737341/0 in Finitomata.Distributed.Supervisor.start_link/2>
Thanks @anthony-gonzalez-kantox!
The plot thickens. This error log is way more looking like a rolling update issue. I could not figure out how to test it, but it should fix it.
Package published to https://hex.pm/packages/finitomata/0.26.1 (099a6e9fc3999e71d6ab075ae7f1a3eafb65248906507f904dc9609087140b9e)
Thank you, but almost there!
GenServer Finitomata.Rambla.Handlers.Amqp.DefinedHandler.Infinitomata.IdLookup terminating
** (FunctionClauseError) no function clause matching in anonymous fn/3 in Finitomata.Distributed.Supervisor.synch/2
(finitomata 0.26.1) lib/finitomata/distributed/supervisor.ex:64: anonymous fn("PoolWorker_1", %{node: :"app@10.3.0.7", pid: #PID<54500.2263.0>, ref: #Reference<54498.2681513529.2332819458.143000>}, %{node: :"app@10.3.3.150", pid: #PID<54498.2265.0>, ref: #Reference<0.2004287609.2871525378.160922>}) in Finitomata.Distributed.Supervisor.synch/2
(stdlib 5.2.2) maps.erl:199: :maps.merge_with_1/4
(elixir 1.16.2) lib/enum.ex:2528: Enum."-reduce/3-lists^foldl/2-0-"/3
(elixir 1.16.2) lib/agent/server.ex:23: Agent.Server.handle_call/3
(stdlib 5.2.2) gen_server.erl:1131: :gen_server.try_handle_call/4
(stdlib 5.2.2) gen_server.erl:1160: :gen_server.handle_msg/6
(stdlib 5.2.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.2273.0>): {:update, #Function<10.73131528/1 in Finitomata.Distributed.Supervisor.synch/2>}
Package published to https://hex.pm/packages/finitomata/0.26.2 (58990f2b26943fd6e1f775093456d337d6a35aa4cf8b8e29f6c0c132b9b2e39b)
It does look like you don’t change erlang cookie between deploys and new nodes can see old ones.
That's correct, the cookie doesn't get renewed between deploys, I'm using AWS ECS and if I renew the cookie every time that there's a deploy when just one node gets replaced it will not be able to join the the one already running so I use the same cookie, and as you say new nodes can see old ones which creates a conflict a tries to remote call on dead nodes.
{"@hostname":"5bc7578acad1","@node":"app@ip","@timestamp":"2024-08-22T23:03:29.398","@type":"log","message":"[♻️] Distributed: [id: Finitomata.Defined.Infinitomata, node: :\"app@ip\", target: #CurrencyPair<\"EURUSD\">, error: :nodedown]","metadata":{"line":64,"pid":"#PID<0.524.0>","time":1724367809398599,"file":[108,105,98,47,105,110,102,105,110,105,116,111,109,97,116,97,46,101,120],"gl":"#PID<0.393.0>","domain":["elixir"],"application":"finitomata","mfa":"{Infinitomata, :do_distributed_call, 6}"},"severity":"error"}
{"@hostname":"5bc7578acad1","@node":"app@ip","@timestamp":"2024-08-22T23:03:29.410","@type":"log","message":"** (exit) exited in: GenServer.call(Finitomata.Defined.Infinitomata.Infinitomata.IdLookup, {:get, #Function<0.75825784/1 in Finitomata.Distributed.Supervisor.all/1>}, 5000)\n ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started\n (elixir 1.16.2) lib/gen_server.ex:1103: GenServer.call/3\n (finitomata 0.26.2) lib/finitomata/distributed/supervisor.ex:95: Finitomata.Distributed.Supervisor.all/1\n (finitomata 0.26.2) lib/finitomata/distributed/supervisor.ex:60: Finitomata.Distributed.Supervisor.synch/2\n (finitomata 0.26.2) lib/infinitomata.ex:68: Infinitomata.do_distributed_call/6\n (app 0.3.0) lib/off_broadway/broadways/ws.ex:64: OffBroadway.Broadways.Ws.handle_message/3\n (broadway 1.1.0) lib/broadway/topology/processor_stage.ex:168: anonymous fn/6 in Broadway.Topology.ProcessorStage.handle_messages/4\n (telemetry 1.2.1) /tmp/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3\n (broadway 1.1.0) lib/broadway/topology/processor_stage.ex:155: Broadway.Topology.ProcessorStage.handle_messages/4","metadata":{"line":186,"pid":"#PID<0.524.0>","time":1724367809410217,"file":[108,105,98,47,98,114,111,97,100,119,97,121,47,116,111,112,111,108,111,103,121,47,112,114,111,99,101,115,115,111,114,95,115,116,97,103,101,46,101,120],"gl":"#PID<0.393.0>","domain":["elixir"],"application":"broadway","mfa":"{Broadway.Topology.ProcessorStage, :handle_messages, 4}","crash_reason":"{{:noproc, {GenServer, :call, [Finitomata.Defined.Infinitomata.Infinitomata.IdLookup, {:get, #Function<0.75825784/1 in Finitomata.Distributed.Supervisor.all/1>}, 5000]}}, [{GenServer, :call, 3, [file: ~c\"lib/gen_server.ex\", line: 1103]}, {Finitomata.Distributed.Supervisor, :all, 1, [file: ~c\"lib/finitomata/distributed/supervisor.ex\", line: 95]}, {Finitomata.Distributed.Supervisor, :synch, 2, [file: ~c\"lib/finitomata/distributed/supervisor.ex\", line: 60]}, {Infinitomata, :do_distributed_call, 6, [file: ~c\"lib/infinitomata.ex\", line: 68]}, {OffBroadway.Broadways.Ws, :handle_message, 3, [file: ~c\"lib/off_broadway/broadways/ws.ex\", line: 64]}, {Broadway.Topology.ProcessorStage, :\"-handle_messages/4-fun-0-\", 6, [file: ~c\"lib/broadway/topology/processor_stage.ex\", line: 168]}, {:telemetry, :span, 3, [file: ~c\"/tmp/deps/telemetry/src/telemetry.erl\", line: 321]}, {Broadway.Topology.ProcessorStage, :handle_messages, 4, [file: ~c\"lib/broadway/topology/processor_stage.ex\", line: 155]}]}"},"severity":"error"}
Package published to https://hex.pm/packages/finitomata/0.26.3 (dfb8ee59dcd211cb36939def6b83270fec710f1fb0670e0d70e30c6a70a05bd7)
I made it even more defensive, although I doubt this particular fix was ever needed. Well, the process crashes, gets restarted, and everything should be fine. Anyway.
if I renew the cookie every time that there's a deploy when just one node gets replaced it will not be able to join the the one already running
Eh? What would provoke the deployment of the only one node? The code will fail in many different ways if nodes run different versions of the software, AWS has never been able to do hot upgrades. The proper way would be to generate the new cookie for each deploy and then deploy all nodes. The application itself should handle on_terminate/2
callbacks and do somewhat meaningful, or do nothing, but in any case allowing new nodes to connect to the older ones must be explicitly handled by the application.
For some reason that I haven't been able to replicate locally only happens randomly in ECS, the pool worker map
Infinitomata.all(Finitomata.Rambla.Handlers.Amqp.DefinedHandler)
is empty when a deployment is done, causing thepublish/3
to fail. Why exactly is necessary to get a poolworker from that map to be able to publish?