IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.06k stars 722 forks source link

Missing metric for current forging mode on a blockproducer node #5751

Open gitmachtl opened 6 months ago

gitmachtl commented 6 months ago

With the push to P2P also on the BlockProducer node, SPOs need to change there backup/failover infrastructure.

A test with node 8.9.1 showed, turning the blockproduction off on a node via a SIGHUP signal (and removed credential files), all last metrics of the blockproducer stays. So there is currently no way to detect the forging mode of a running node that way.

We would need an additional metric (accessable via prometheus interface or a cli query) with value 1 = forging and 0 = not forging, to be sure we know what the node is currently doing. And the backup/failover infra can handle the workmode accordingly.

Trying to get this data out of the logfiles is not a nice way, and i am sure we can add this metric to all the other available values.

coot commented 5 months ago

We install a SIGHUP handler in cardano-node (see) which contains the necessary information; this only requires adding an EKG counter in the node, which makes it quite easy to implement (no need to modify anywhere deeper in the stack, e.g. ouroboros-consensus).

gitmachtl commented 5 months ago

@coot thx, please make sure that the "isForging" EKG/Prometheus metric is also reported correctly as false(0) if the blockproducer node was started with the --non-producing-node option. that should be the default in an active/standby blockproducer backup infra. start it up with the credentials but in non producing mode. check if the state to promote it as an active producer is ok (like is the node on tip, other blockproducers not active), if so, reload the settings via a SIGHUP signal and start forging.

coot commented 5 months ago

I don't think I will be working on it, but I'll bring it to the attention of the core team.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 120 days.