Monitoring metrics for cybernode

abitrolly commented 6 years ago

Related issues:

Story A - healthcheck for cybernode

As a user, I want to see that my cybernode is healthy, and if it is not, then see the reason why.

Story B - cybernode monitoring

As an developer/contributor, I want to see how cybernode works. I want to see what is it doing, if there are any bottlenecks or anomalies. If the node synchronized with other nodes.

Design Considerations

For people who want a simple status, looking at page with all bells&whistles is not fun. It is possible to design fluid SVG interface (Lottie?) that may contain the whole cybernode blueprint with all moving components, and color each component according to its status. SCADA system on Lottie.

But before we get there, we may use Prometheus+Grafana for all sorts of required info, and we should hide advanced options.

Data and processes

List of processes that cybernode is doing:

chain indexation
quering exchanges
calculating tickers
calculating block/tx/... analytics

Story C - cybernode sanity

As a "business" owner I want to be absolutely sure that cybernode is sane and is giving the latest available information to make "business" decisions. That includes stats like if we don't get expected block in time, if there is something with network, and it should be visible somehow on the main page.

Story D - cybernode tamagochi

cybernode is likely not the only process running on the system, so it would be nice to see how much does it "cost" to run certain components. Before we can tell that, we need to collect that stats. ... saving ...

abitrolly commented 6 years ago

Stats per component like how much CPU it is consuming, how much memory it eats, does memory usage grow over time? Also, what achievements (CPU, mem, hard) I need to get to add new components (features/abilities).

Story E - backend ops

As an application developer, I want to know ~~how much time user spends in app~~ about backend errors that are occurring for specific user requests. Get events that something is failed or crashed. I need to know if the node went down and when it was down, and while it is down, if people are made changes on client app.

For example, I track some user level events only when backend is working. When backend is down, tx is coming, but we don't catch it and can't say to app. When connection is restored, we resync and we may miss event when tx comes to mempool AND THEN to block - we only get tx in block.

As a developer, I also want to trace speed of requests and various components that add to final lag like speed of DB access. ... saving ...

abitrolly commented 6 years ago

Metrics

Indexation process.

[ ] index_delay - time from when we received the block till we process it
[ ] index_queue - number of blocks that are waiting to be processed
[ ] block_speed - critical we should receive a block every 10 minutes (bitcoin), so here the graph of how many minutes (bitcoin) or seconds (ethereum) have passed since last block, with hard alerts if it goes beyond 10 minutes and 1 minute repectively

abitrolly commented 6 years ago

[ ] data_per_block - critical how much data we get from one block, this includes storage size for block data and size increments for all indexes that are updated as a result.

abitrolly commented 6 years ago

[ ] index_time - time for processing one block - from getting API request to process the block till process is complete (this could be extracted from Zipkin)

mastercyb commented 6 years ago

Block height of every blockchain node is the most needed thing to start from

hleb-albau commented 6 years ago

@abitrolly Could you, pleasae, check our lates monitoring service: http://monitoring.cybersearch.io/d/94l_L2Nmz/elassandra-monitoring?refresh=1m&orgId=1 http://monitoring.cybersearch.io/dashboards

hleb-albau commented 6 years ago

not active, closed.

bro-n-bro / cybernode