Closed miguelfteixeira closed 5 years ago
Looks like the failing tests are from the master build: https://travis-ci.com/bitwalker/swarm
Hey @bitwalker any chance of getting feedback on this any time soon? 🙏 I'm just wondering if I'm going in the wrong direction.
Thanks 🙏
Thank you for this fix. We have been testing your change together with the clock fix (https://github.com/bitwalker/swarm/pull/116) and swarm seems to work very reliably since then.
I hope these two changes can be merged soon!
Thanks again!
@tschmittni we're actually doing the same :smiley: please let me know if you find any anything 🙏 I'm currently trying to debug an issue where the Swarm Tracker get's stuck after a restart (we're running in k8s).
{_, state} = :sys.get_state(Swarm.Tracker)
** (exit) exited in: :sys.get_state(Swarm.Tracker)
** (EXIT) time out
(stdlib) sys.erl:315: :sys.send_system_msg/2
(stdlib) sys.erl:116: :sys.get_state/1
Have you stumbled on this? :thinking:
This looks good to me. I'll also give this change combined with #116 a real-life test, as I'm experiencing some deadlocks myself as well.
At Talkdesk we've stumbled at #91 and after an initial analysis using swarm-deadlock-repro we've found that the issue occurred when a node contains a pending sync request from the
sync_node
.This PR proposes a solution that consist on discarding pending sync request from the
sync_node
. Although I lack some context regarding the Swarm implementation I ran some tests using this approach and it untangled the node synchronization.Execution logs of the issue: https://gist.github.com/miguelfteixeira/1e587dadff718ad73c34011308730409
Diagram from the execution logs: