galacticcouncil / polkalert

Polkadot / Substrate based node monitoring.
Apache License 2.0
17 stars 8 forks source link

Create prometheus endpoint #62

Open ddorgan opened 4 years ago

ddorgan commented 4 years ago

Hi there! So I'm really liking polkalert and would like to use it for monitoring a bunch of hosts. But to avoid re-doing data collection / alerting rules etc... we'd like to use our standard infrastructure to scrape the data from polkalert.

So we'd normally do this via a prometheus endpoint ... the data format is just key, value with some labels. An example would be the offences monitor at: https://github.com/w3f/offences-monitor

Or for example out dotexporter (just for basic polkadot stats) output looks like:

dot_chain_block_number{name="parity-polkadot",version="0.7.17",chain="Kusama CC3",block="finalized"} 636323
dot_chain_block_number{name="parity-polkadot",version="0.7.17",chain="Kusama CC3",block="head"} 636324
dot_peer_count{name="parity-polkadot",version="0.7.17",chain="Kusama CC3"} 185
dot_shouldHavePeers{name="parity-polkadot",version="0.7.17",chain="Kusama CC3"} 1
dot_isSyncing{name="parity-polkadot",version="0.7.17",chain="Kusama CC3"} 0
dot_specVersion{name="parity-polkadot",version="0.7.17",chain="Kusama CC3"} 1039
dot_rpc_healthy{name="parity-polkadot",version="0.7.17",chain="Kusama CC3"} 1

Any chance you'd have time to work on this feature request? :-)

Many thanks, David

jak-pan commented 4 years ago

Hi, @ddorgan sorry for the delay, we were busy with finishing stuff. We thought about this and figured we can do these metrics, with more coming when we get to next milestone. Is there anything specific you would like to see or possibly change to fill the gap in the rest of the polkadot stats?

exposed metrics:

blocks produced in 24 hours
number of times chosen as validator in 24 hours
slash amount from last scrape period
slash count from last scrape period
caught offline from last scrape period
offline count from last scrape period
offline time total from last scrape period
last finalized head
last received head
average block propagation time lag from last scrape period
equivocation count from last scrape period
current stake self
current stake nominators
current number of nominators
current commission
current number of peers
ddorgan commented 4 years ago

@jak-pan thanks so much for the reply! I think your list looks pretty great. I think specifically we're trying to grab the offline and slashing conditions because we pick up a bunch of other information from all endpoints anyway and aggregate this in a time series database. But all of those stats look super useful and would be great to scrape!