exponentially / extreme

Elixir Adapter for EventStore
MIT License
130 stars 31 forks source link

Add concurrent read/write stress test #46

Closed tonyvanriet closed 6 years ago

tonyvanriet commented 7 years ago

Here's a test that will reliably reproduce the deadlock in #45 for me. I also bumped the timeout on the Extreme.execute call to make the deadlock a bit more apparent. I can revert that if you ever end up wanting to merge the test.

I'm curious if you're able to reproduce the deadlock as well.

With the timeout change and the default config of the EventStore, this test should result in the Extreme process stopping with reason :tcp_closed. This is the EventStore closing the connection due to a heartbeat timeout, presumably because Extreme did not respond to its heartbeat ping. If you then increase the EventStore heartbeat timeout, the test will produce GenServer timeouts on the Extreme.execute call.

# /etc/eventstore/eventstore.conf
ExtTcpHeartbeatTimeout: 30000
ExtTcpHeartbeatInterval: 60000
burmajam commented 7 years ago

Hi @tonyvanriet,

Sorry for waiting. I did try your example and yes, it times out because of heartbeat. Problem is that read_events fetches 4096 events and it takes more time for Process to process them. Lower max_count to 128 for example and your test will pass!

Please let me know if that works for you

tonyvanriet commented 6 years ago

Thanks for checking on that, Milan.

It's true that this particular test doesn't fail with the smaller fetch size, but I'm not sure that helps to explain the failure. When the failure occurs, it's not a big slow read that happens to be long enough to trigger the EventStore heartbeat timeout. Both the read and write are completely locked up and there's zero activity for several seconds. If I increase the GenServer timeouts and EventStore heartbeat timeout, the deadlock will persist until the timeout occurs. Also, when this occurs for us in production, there's very light traffic that would not fill up a 4096 event fetch.

As I mentioned, we've mitigated the issue by using multiple Extreme workers. We will have to get back to this issue and find root cause. When we do, I'll let you know what we find.

Thanks for all your help.

mjaric commented 6 years ago

This pull request should fix this issue, #47

burmajam commented 6 years ago

Fixed by #47