Chainflow / oasis-mission-control

MIT License
2 stars 2 forks source link

4 - Collect feedback on dashboard prototype #3

Open cfl0ws opened 3 years ago

cfl0ws commented 3 years ago

Oasis Mission Control Call for Feedback

Chainflow and our development partner Vitwit have been awarded an Oasis grant to build the Oasis Mission Control Validator Monitoring and Alerting Dashboard. You can find more details about that here.

We are feeling excited to share this prototype with the community. Validators, we're building this for you.

Please review the work done so far and provide feedback. We'll use this feedback to update the prototype to provide a final and open-sourced version for their use.

For example -

1 - Is the dashboard missing any key metrics?

2 - Are there any additional alerts you'd like to see be made available?

3 - Is there anything we can do to organize the information in a more user-friendly way, e.g. reorganize existing dashboards and/or create new ones?

Please provide your feedback in the comments of this issue.

Here's a brief overview of the dashboards and current alerts.

Summary Dashboard

This view provides a quick-look at overall validator and system health.

Screen Shot 2020-10-27 at 9 19 24 AM Screen Shot 2020-10-27 at 9 19 47 AM

Validator Monitoring Dashboard

This view provides a comprehensive look at validator details and performance, expanding on the summary dashboard. It will also includes proposal information, once Oasis implements a Governance module.

Note: The system displays the number of total peers. For those that choose to implement a sentry node configuration, we will implement a metric that shows the peer names as well.

This is useful to confirm a validator is connected to the peers an operator would expect their validator to be connected to. In this scenario, there will also be an alert configured that alerts a user if the number of peers drops below a specified number.

For example, if your validator is connected to two sentries, the system will alert you if the number of peers drops below two.

Screen Shot 2020-10-27 at 9 22 07 AM Screen Shot 2020-10-27 at 9 22 22 AM

System Monitoring Dashboard

This view provides a comprehensive look at system performance metrics, expanding on the summary dashboard. Here you'll find all the system metrics you'd expect to see in a comprehensive system monitoring tool.

Screen Shot 2020-10-27 at 9 23 59 AM Screen Shot 2020-10-27 at 9 24 13 AM Screen Shot 2020-10-27 at 9 24 29 AM Screen Shot 2020-10-27 at 9 24 49 AM Screen Shot 2020-10-27 at 9 25 01 AM Screen Shot 2020-10-27 at 9 25 14 AM Screen Shot 2020-10-27 at 9 25 30 AM Screen Shot 2020-10-27 at 9 25 40 AM Screen Shot 2020-10-27 at 9 25 51 AM

Alerting

So far, these alerts are configured -

This image shows some of those alerts in action.

Screen_Shot_2020-09-07_at_2 34 22_PM
cfl0ws commented 3 years ago

As no feedback was received in this round of collection, we are happy to do another round of updates after the community has had a chance to use the tool.

joesixpack commented 3 years ago

FYI, the dashboards don't completely work. Example...

image

cfl0ws commented 3 years ago

@joesixpack apologies for the delayed response. Can you please provide additional context?

I'm assuming this is a screenshot of an implementation you attempted? If so, what steps did you follow?

cc: @PrathyushaLakkireddy

PrathyushaLakkireddy commented 3 years ago

@joesixpack, few metrics will be displayed from prometheus and few are from based on the network url which you have configured. So to get that prometheus metrics working have to enable these commands of oasis node --metrics.mode pull --metrics.address <listen-address>:3000 And also can you once check configured network url, whether it's working or not?

joesixpack commented 3 years ago

For network URL I'm using "http://157.230.100.229:3000" which is your server.

Oasis config.toml has:

metrics: mode: pull address: 0.0.0.0:9999

Port 3000 is not available to use as that is what Grafana uses for its web dashboard.

Is the network url actually supposed to point to my own node's metric address? That is not stated in the docs. That makes some kind of sense and I tried that and port 3001 also, but the dashboard errors (Bad Gateway) didn't resolve.

Regardless, I ran into that edge case bug twice already so since I can't upgrade to 20.12.3 yet (I did accidentally and it worked fine before reverting), I'll have to shut down the mission contol to prevent another crash.

joesixpack commented 3 years ago

There's also what looks like missing and/or wrongly named datasources in some of the dashboards.

PrathyushaLakkireddy commented 3 years ago

Sorry for that issue @joesixpack. If you have any other network's URL, you can mention that or else you can keep same one which we have provided. I will update the dashboards of grafana to resolve Bad Gateway.

joesixpack commented 3 years ago

I'm seeing this in the log:

2021/01/05 01:26:58 Error while unmarshelling the validator set data proto: wrong wireType = 0 for field Ed25519

cfl0ws commented 3 years ago

@PrathyushaLakkireddy please take a look 👆

@joesixpack note we currently recommend ONLY running Oasis Mission Control with v20.12.3. This is due to a bug in the Oasis code that was fixed in v20.12.3.

See details here.

Note that the chances of the bug crashing the validator when running Mission Control are very low. We ran Chainflow's instance without a problem for a couple months, then the bug nailed us. It's for this reason we're suggesting to stay on the safe side and wait until you're running v20.12.3 on mainnet.

PrathyushaLakkireddy commented 3 years ago

I'm seeing this in the log:

2021/01/05 01:26:58 Error while unmarshelling the validator set data proto: wrong wireType = 0 for field Ed25519

Fixed.

joesixpack commented 3 years ago

Could you upload the dashboards to Grafana and provide the #'s to import?