bitwalker / swarm

Easy clustering, registration, and distribution of worker processes for Erlang/Elixir
MIT License
1.2k stars 103 forks source link

Discard pending sync request from sync_node #118

Closed miguelfteixeira closed 5 years ago

miguelfteixeira commented 5 years ago

At Talkdesk we've stumbled at #91 and after an initial analysis using swarm-deadlock-repro we've found that the issue occurred when a node contains a pending sync request from the sync_node.

This PR proposes a solution that consist on discarding pending sync request from the sync_node. Although I lack some context regarding the Swarm implementation I ran some tests using this approach and it untangled the node synchronization.

Execution logs of the issue: https://gist.github.com/miguelfteixeira/1e587dadff718ad73c34011308730409

Diagram from the execution logs: lock_on_awaiting_sync_ack

miguelfteixeira commented 5 years ago

Looks like the failing tests are from the master build: https://travis-ci.com/bitwalker/swarm

miguelfteixeira commented 5 years ago

Hey @bitwalker any chance of getting feedback on this any time soon? 🙏 I'm just wondering if I'm going in the wrong direction.

Thanks 🙏

tschmittni commented 5 years ago

Thank you for this fix. We have been testing your change together with the clock fix (https://github.com/bitwalker/swarm/pull/116) and swarm seems to work very reliably since then.

I hope these two changes can be merged soon!

Thanks again!

miguelfteixeira commented 5 years ago

@tschmittni we're actually doing the same :smiley: please let me know if you find any anything 🙏 I'm currently trying to debug an issue where the Swarm Tracker get's stuck after a restart (we're running in k8s).

{_, state} = :sys.get_state(Swarm.Tracker)
** (exit) exited in: :sys.get_state(Swarm.Tracker)
    ** (EXIT) time out
    (stdlib) sys.erl:315: :sys.send_system_msg/2
    (stdlib) sys.erl:116: :sys.get_state/1

Have you stumbled on this? :thinking:

arjan commented 5 years ago

This looks good to me. I'll also give this change combined with #116 a real-life test, as I'm experiencing some deadlocks myself as well.