cardano-scaling / hydra

Implementation of the Hydra Head protocol
https://hydra.family/head-protocol/
Apache License 2.0
276 stars 84 forks source link

Collect & report metrics #183

Open ch1bo opened 2 years ago

ch1bo commented 2 years ago

What & Why

To measure success (or failure) of the Hydra Head project and improve continuously, we need to know how many Hydra Heads are opened, how long they are used, how many UTXOs are moved into / out of a Head etc. Most of this information is publicly available and can be derived by observing the main-chain. The remainder (e.g. transactions sizes & number of UTXOs in a Head), will be collected from within the hydra-node and will be opt-out once we reach mainnet maturity.

TBD

Tasks

abailly-iohk commented 2 years ago

With a stateless "chain observer" available, we could host a simple "Hydra Head Explorer" service online that would show and track the state of heads running on some chain?

abailly-iohk commented 2 years ago

Couple of basic ideas:

abailly-iohk commented 2 years ago

I have setup and used jaeger and zipkin in the past, including inside Haskell apps and having a way to track the processing of user requests across a distributed system is invaluable to understand its behaviour.

Looking at https://github.com/ethercrow/opentelemetry-haskell which provides support for traces. Someone pointed me at https://opentelemetry.io/docs/concepts/data-collection/ which provides a conceptual framework for all kind of "observability" data collection. In particular, opentelemetry (used to be called openjaeger) defines some standards to provide interoperability between various kind of services, allowing for example to collect and export Prometheus metrics, logs and traces to some other service.

We currently expose the following metrics in the node:

Handling and possibly tuning of snapshots size is important for the protocol so we should add:

Also:

Traces could be an interesting addition to analyse the trace generated by a NewTx coming from a client and how it spreads across the network until the transaction becomes confirmed. This would be helpful in particular to understand the behaviour of the network if/when we move away from fully connected network to something more dynamic or less densely connected, with routing between the nodes. Not sure if it's worthwhile to do it now though.

Tasks for this feature: