Closed tonyvanriet closed 6 years ago
Hi @tonyvanriet,
Sorry for waiting. I did try your example and yes, it times out because of heartbeat. Problem is that read_events fetches 4096 events and it takes more time for Process to process them. Lower max_count to 128 for example and your test will pass!
Please let me know if that works for you
Thanks for checking on that, Milan.
It's true that this particular test doesn't fail with the smaller fetch size, but I'm not sure that helps to explain the failure. When the failure occurs, it's not a big slow read that happens to be long enough to trigger the EventStore heartbeat timeout. Both the read and write are completely locked up and there's zero activity for several seconds. If I increase the GenServer timeouts and EventStore heartbeat timeout, the deadlock will persist until the timeout occurs. Also, when this occurs for us in production, there's very light traffic that would not fill up a 4096 event fetch.
As I mentioned, we've mitigated the issue by using multiple Extreme workers. We will have to get back to this issue and find root cause. When we do, I'll let you know what we find.
Thanks for all your help.
Fixed by #47
Here's a test that will reliably reproduce the deadlock in #45 for me. I also bumped the timeout on the
Extreme.execute
call to make the deadlock a bit more apparent. I can revert that if you ever end up wanting to merge the test.I'm curious if you're able to reproduce the deadlock as well.
With the timeout change and the default config of the EventStore, this test should result in the Extreme process stopping with reason
:tcp_closed
. This is the EventStore closing the connection due to a heartbeat timeout, presumably because Extreme did not respond to its heartbeat ping. If you then increase the EventStore heartbeat timeout, the test will produce GenServer timeouts on theExtreme.execute
call.