cardano-scaling / hydra

Implementation of the Hydra Head protocol
https://hydra.family/head-protocol/
Apache License 2.0
282 stars 86 forks source link

Event Log Rotation and Memory Growth #1581

Open Quantumplation opened 2 months ago

Quantumplation commented 2 months ago

Why

While working on the hydra-doom project, we noticed that both the on-disk state and the in memory state grew without bound (see #1572)

This meant that, at the sustained load that the hydra doom demo was producing, nodes became inoperable after just a few hours. The hack in #1572 helped, but on-disk state still needed to be rotated regularly, by hand.

This consisted of stopping the nodes, renaming the data directory, bringing the nodes back up, and then shipping the data directory off to archival storage. And this only worked because we were using offline nodes and didn't mind interrupting the head.

What

I'd like to propose that the hydra head implement checkpointing for the event log.

How

This is just a proposed implementation, feel free to adapt to better fit the intricacies of the hydra codebase.

This would allow a 3rd party agent to detect the checkpoint and trigger any appropriate archival / backup / cleanup that was needed, without interrupting the hydra head, hydra heads would be able to recover faster after a failure, and memory usage would be kept within a bounded limit.

Again, I'm super unfamiliar with the hydra codebase, so there might be more subtleties that are needed, but I just wanted to get the ball rolling on a discussion :)

ch1bo commented 2 months ago

As it was only mentioned in passing in this item, we might want to scope separate item(s) about the memory growth in:

ch1bo commented 1 month ago

Created https://github.com/cardano-scaling/hydra/issues/1618 to cover the API server part of tackling memory growth.