Joystream / community-repo

A repo for community contribution and documentation
GNU General Public License v3.0
19 stars 73 forks source link

Distributor nodes health probe + Discord notification #804

Closed singulart closed 2 years ago

singulart commented 2 years ago

Context Requested by Storage WG Lead AlexZNet | C.Sailors#0968 as a must-have for mainnet.

Scope Discord bot that pings distributor providers /distributor/api/v1/status endpoint. Client timeout = 250ms (configurable) Failure to receive the HTTP 200 within the specified interval yields a notification in #distributors channel.

To avoid spamming the channel, the following flow is suggested when the failing node is detected.

Record the failing node endpoint in a DB Schedule notifications every 15 minutes (configurable) that scans the DB record and sends 1 "summary" notification for all failing nodes. If node gets back up, remove the DB record. Example notification

@Distridution Worker Failing node(s) alert:

  1. http://cutieblockchains.com
  2. http://cutieblockchains.com This functionality needs to be added to the existing codebase: https://github.com/singulart/joy-disco-bots

Estimate 6-8h

singulart commented 2 years ago

@DzhideX @bwhm Please assign SPs

OIgnt commented 2 years ago

Proposing method for getting a list of nodes for checking - https://discord.com/channels/811216481340751934/933726271832227911/1010873958461087834

query{ distributionBucketOperators(where: {distributionBucket: {distributing_eq: true}}){ id, metadata { nodeEndpoint } } }

So it would be taking only nodes that are in distributing set - because all other ones are not interesting for us. This query would return everything we need - node itself + family:bucket-woerker

singulart commented 2 years ago

@DzhideX @bwhm Still waiting for a SP estimate for this

bwhm commented 2 years ago

Hey, I agree tools like these are needed, but discord doesn't seem like the right platform?

I would imagine both Workers and Leads deploy a system that alerts them on email, or some app that doesn't post notifications unless an "urgent" action is required. A discord notification would likely drown most of the time...?

OIgnt commented 2 years ago

I think we can close this ticket - we've got a tool for doing such checks and excluding distributors from active set automatically. https://discord.com/channels/811216481340751934/933726271832227911/1011700093356888224

singulart commented 2 years ago

Closing as per comment above

traumschule commented 2 years ago

To document the result, move to Done

I have read comments in Github about bot that we discussed, and looks like @bwhm does not find it as very effective tool. Maybe messaging to one particular messenger is not a best approach really and I agree with it, but that was a first thought that came to my mind after @klaudiusz.eth raised this issue again. And I was keeping SP bot on my mind as example. Having thought for a while I came up with my own tool that does that check automatically. It sends request to QN and gets all distributor nodes that are active at the moment and fires test requests to status endpoint of each of the nodes. In case distributor node cannot not pass this check N times (this script is run by cron) - excludes it from distributing set automatically. This script is already running and checking nodes in my group https://github.com/alexznet/JosytreamDistributorWatcher/blob/main/distributor-watcher.sh