Discovered during the hydra-doom demo at Rare Evo, using both the :doom and :doom-memory-hack tagged docker images built by @ch1bo
Steps to reproduce
(This may not be a minimal reproduce and requires further investigation)
Spin up 16 hydra nodes on an r5.8xlarge with 1TB of disk space
Submit between 35 and 250 transactions per second to each node for several hours.
Reach out to me if you want help reproducing the issue, I have a saved disk snapshot from the event.
Actual behavior
Eventually, the hydra heads disconnect from websocket connections, refuse new connections, and the host itself becomes unresponsive to SSH; The machine must be power cycled to regain access.
Initially this was believed to be because of the memory leak, but even with the hacks from #1572, this would occur.
It's also unlikely to be disk space, because several of the nodes were only at 60% disk used after rebooting.
Expected behavior
A hydra node should be able to operate indefinitely at high load, provided basic assumptions like enough disk space etc.
If this is an issue with something about the server provisioning, rather than the hydra node itself, a useful outcome of this story would be to document best practices for hosting the nodes and how to avoid this scenario.
Context & versions
Discovered during the hydra-doom demo at Rare Evo, using both the
:doom
and:doom-memory-hack
tagged docker images built by @ch1boSteps to reproduce
(This may not be a minimal reproduce and requires further investigation)
Spin up 16 hydra nodes on an r5.8xlarge with 1TB of disk space
Submit between 35 and 250 transactions per second to each node for several hours.
Reach out to me if you want help reproducing the issue, I have a saved disk snapshot from the event.
Actual behavior
Eventually, the hydra heads disconnect from websocket connections, refuse new connections, and the host itself becomes unresponsive to SSH; The machine must be power cycled to regain access.
Initially this was believed to be because of the memory leak, but even with the hacks from #1572, this would occur. It's also unlikely to be disk space, because several of the nodes were only at 60% disk used after rebooting.
Expected behavior
A hydra node should be able to operate indefinitely at high load, provided basic assumptions like enough disk space etc.
If this is an issue with something about the server provisioning, rather than the hydra node itself, a useful outcome of this story would be to document best practices for hosting the nodes and how to avoid this scenario.