Empty Pool when a deployment is done with a cluster of 2 nodes in ECS

anthony-gonzalez-kantox commented 2 months ago

For some reason that I haven't been able to replicate locally only happens randomly in ECS, the pool worker map Infinitomata.all(Finitomata.Rambla.Handlers.Amqp.DefinedHandler) is empty when a deployment is done, causing the publish/3 to fail. Why exactly is necessary to get a poolworker from that map to be able to publish?

am-kantox commented 2 months ago

the pool worker map Infinitomata.all(…) is empty when a deployment is done

Does it stay empty forever?

Why exactly is necessary to get a poolworker from that map to be able to publish?

Because publishing is being done with a pool to make it able to publish from different nodes using different processes when the publishing queue is full.

I cannot resolve stuff like “sometimes the stuff fails,” please share logs at least.

anthony-gonzalez-kantox commented 2 months ago

Does it stay empty forever?

Yes

** (Enum.EmptyError) empty error
    (elixir 1.16.2) lib/enum.ex:2395: Enum.random/1
    (finitomata 0.25.0) lib/finitomata/pool.ex:200: Finitomata.Pool.run/3
    (app 0.3.0) lib/app/finitomata/project.ex:288: MyApp.function/3
    (app 0.3.0) lib/app/finitomata/project.ex:206: MyApp.function/1
    (app 0.3.0) lib/app/finitomata/project.ex:186: anonymous fn/2 in MyApp.Finitomata.LivePair.on_transition/4
    (elixir 1.16.2) lib/enum.ex:987: Enum."-each/2-lists^foreach/1-0-"/2
    (app 0.3.0) lib/app/finitomata/project.ex:185: MyApp.Finitomata.Project.on_transition/4
    (app 0.3.0) deps/finitomata/lib/finitomata.ex:1231: MyApp.Finitomata.Project.safe_on_transition/5

am-kantox commented 2 months ago

Well, this code should not raise anyway.

I did a blind fix to retry on empty pool with a tiny timeout. It’s in main of finitomata until mox is finally upgraded.

am-kantox commented 2 months ago

Finitomata v0.26.0 has been released, possibly closing this.

anthony-gonzalez-kantox commented 2 months ago

It didn't solve the problem, but I found more logs that could narrow down the problem.


Task #PID<0.2250.0> started from #PID<0.2240.0> terminating
** (stop) exited in: GenServer.call(Finitomata.Rambla.Handlers.Amqp.DefinedHandler.Infinitomata.IdLookup, {:update, #Function<10.104737341/1 in Finitomata.Distributed.Supervisor.synch/2>}, 5000)
    ** (EXIT) an exception was raised:
        ** (FunctionClauseError) no function clause matching in anonymous fn/3 in Finitomata.Distributed.Supervisor.synch/2
            (finitomata 0.26.0) lib/finitomata/distributed/supervisor.ex:64: anonymous fn("PoolWorker_1", %{node: :"app@10.3.0.70", pid: nil, ref: #Reference<54373.2114719762.490209281.55285>}, %{node: :"app@10.3.2.195", pid: #PID<54372.2252.0>, ref: #Reference<54372.3183318594.3711172609.118569>}) in Finitomata.Distributed.Supervisor.synch/2
            (stdlib 5.2.2) maps.erl:199: :maps.merge_with_1/4
            (elixir 1.16.2) lib/enum.ex:2528: Enum."-reduce/3-lists^foldl/2-0-"/3
            (elixir 1.16.2) lib/agent/server.ex:23: Agent.Server.handle_call/3
            (stdlib 5.2.2) gen_server.erl:1131: :gen_server.try_handle_call/4
            (stdlib 5.2.2) gen_server.erl:1160: :gen_server.handle_msg/6
            (stdlib 5.2.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
    (elixir 1.16.2) lib/gen_server.ex:1114: GenServer.call/3
    (elixir 1.16.2) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
Function: #Function<16.104737341/0 in Finitomata.Distributed.Supervisor.start_link/2>

am-kantox commented 2 months ago

Thanks @anthony-gonzalez-kantox!

The plot thickens. This error log is way more looking like a rolling update issue. I could not figure out how to test it, but it should fix it.

Package published to https://hex.pm/packages/finitomata/0.26.1 (099a6e9fc3999e71d6ab075ae7f1a3eafb65248906507f904dc9609087140b9e)

anthony-gonzalez-kantox commented 2 months ago

Thank you, but almost there!


GenServer Finitomata.Rambla.Handlers.Amqp.DefinedHandler.Infinitomata.IdLookup terminating
** (FunctionClauseError) no function clause matching in anonymous fn/3 in Finitomata.Distributed.Supervisor.synch/2
    (finitomata 0.26.1) lib/finitomata/distributed/supervisor.ex:64: anonymous fn("PoolWorker_1", %{node: :"app@10.3.0.7", pid: #PID<54500.2263.0>, ref: #Reference<54498.2681513529.2332819458.143000>}, %{node: :"app@10.3.3.150", pid: #PID<54498.2265.0>, ref: #Reference<0.2004287609.2871525378.160922>}) in Finitomata.Distributed.Supervisor.synch/2
    (stdlib 5.2.2) maps.erl:199: :maps.merge_with_1/4
    (elixir 1.16.2) lib/enum.ex:2528: Enum."-reduce/3-lists^foldl/2-0-"/3
    (elixir 1.16.2) lib/agent/server.ex:23: Agent.Server.handle_call/3
    (stdlib 5.2.2) gen_server.erl:1131: :gen_server.try_handle_call/4
    (stdlib 5.2.2) gen_server.erl:1160: :gen_server.handle_msg/6
    (stdlib 5.2.2) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.2273.0>): {:update, #Function<10.73131528/1 in Finitomata.Distributed.Supervisor.synch/2>}

am-kantox commented 2 months ago

Package published to https://hex.pm/packages/finitomata/0.26.2 (58990f2b26943fd6e1f775093456d337d6a35aa4cf8b8e29f6c0c132b9b2e39b)

It does look like you don’t change erlang cookie between deploys and new nodes can see old ones.

anthony-gonzalez-kantox commented 2 months ago

That's correct, the cookie doesn't get renewed between deploys, I'm using AWS ECS and if I renew the cookie every time that there's a deploy when just one node gets replaced it will not be able to join the the one already running so I use the same cookie, and as you say new nodes can see old ones which creates a conflict a tries to remote call on dead nodes.

{"@hostname":"5bc7578acad1","@node":"app@ip","@timestamp":"2024-08-22T23:03:29.398","@type":"log","message":"[♻️] Distributed: [id: Finitomata.Defined.Infinitomata, node: :\"app@ip\", target: #CurrencyPair<\"EURUSD\">, error: :nodedown]","metadata":{"line":64,"pid":"#PID<0.524.0>","time":1724367809398599,"file":[108,105,98,47,105,110,102,105,110,105,116,111,109,97,116,97,46,101,120],"gl":"#PID<0.393.0>","domain":["elixir"],"application":"finitomata","mfa":"{Infinitomata, :do_distributed_call, 6}"},"severity":"error"}
{"@hostname":"5bc7578acad1","@node":"app@ip","@timestamp":"2024-08-22T23:03:29.410","@type":"log","message":"** (exit) exited in: GenServer.call(Finitomata.Defined.Infinitomata.Infinitomata.IdLookup, {:get, #Function<0.75825784/1 in Finitomata.Distributed.Supervisor.all/1>}, 5000)\n    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started\n    (elixir 1.16.2) lib/gen_server.ex:1103: GenServer.call/3\n    (finitomata 0.26.2) lib/finitomata/distributed/supervisor.ex:95: Finitomata.Distributed.Supervisor.all/1\n    (finitomata 0.26.2) lib/finitomata/distributed/supervisor.ex:60: Finitomata.Distributed.Supervisor.synch/2\n    (finitomata 0.26.2) lib/infinitomata.ex:68: Infinitomata.do_distributed_call/6\n    (app 0.3.0) lib/off_broadway/broadways/ws.ex:64: OffBroadway.Broadways.Ws.handle_message/3\n    (broadway 1.1.0) lib/broadway/topology/processor_stage.ex:168: anonymous fn/6 in Broadway.Topology.ProcessorStage.handle_messages/4\n    (telemetry 1.2.1) /tmp/deps/telemetry/src/telemetry.erl:321: :telemetry.span/3\n    (broadway 1.1.0) lib/broadway/topology/processor_stage.ex:155: Broadway.Topology.ProcessorStage.handle_messages/4","metadata":{"line":186,"pid":"#PID<0.524.0>","time":1724367809410217,"file":[108,105,98,47,98,114,111,97,100,119,97,121,47,116,111,112,111,108,111,103,121,47,112,114,111,99,101,115,115,111,114,95,115,116,97,103,101,46,101,120],"gl":"#PID<0.393.0>","domain":["elixir"],"application":"broadway","mfa":"{Broadway.Topology.ProcessorStage, :handle_messages, 4}","crash_reason":"{{:noproc, {GenServer, :call, [Finitomata.Defined.Infinitomata.Infinitomata.IdLookup, {:get, #Function<0.75825784/1 in Finitomata.Distributed.Supervisor.all/1>}, 5000]}}, [{GenServer, :call, 3, [file: ~c\"lib/gen_server.ex\", line: 1103]}, {Finitomata.Distributed.Supervisor, :all, 1, [file: ~c\"lib/finitomata/distributed/supervisor.ex\", line: 95]}, {Finitomata.Distributed.Supervisor, :synch, 2, [file: ~c\"lib/finitomata/distributed/supervisor.ex\", line: 60]}, {Infinitomata, :do_distributed_call, 6, [file: ~c\"lib/infinitomata.ex\", line: 68]}, {OffBroadway.Broadways.Ws, :handle_message, 3, [file: ~c\"lib/off_broadway/broadways/ws.ex\", line: 64]}, {Broadway.Topology.ProcessorStage, :\"-handle_messages/4-fun-0-\", 6, [file: ~c\"lib/broadway/topology/processor_stage.ex\", line: 168]}, {:telemetry, :span, 3, [file: ~c\"/tmp/deps/telemetry/src/telemetry.erl\", line: 321]}, {Broadway.Topology.ProcessorStage, :handle_messages, 4, [file: ~c\"lib/broadway/topology/processor_stage.ex\", line: 155]}]}"},"severity":"error"}

am-kantox commented 2 months ago

Package published to https://hex.pm/packages/finitomata/0.26.3 (dfb8ee59dcd211cb36939def6b83270fec710f1fb0670e0d70e30c6a70a05bd7)

I made it even more defensive, although I doubt this particular fix was ever needed. Well, the process crashes, gets restarted, and everything should be fine. Anyway.

if I renew the cookie every time that there's a deploy when just one node gets replaced it will not be able to join the the one already running

Eh? What would provoke the deployment of the only one node? The code will fail in many different ways if nodes run different versions of the software, AWS has never been able to do hot upgrades. The proper way would be to generate the new cookie for each deploy and then deploy all nodes. The application itself should handle on_terminate/2 callbacks and do somewhat meaningful, or do nothing, but in any case allowing new nodes to connect to the older ones must be explicitly handled by the application.

am-kantox / rambla

Empty Pool when a deployment is done with a cluster of 2 nodes in ECS #15