Closed singulart closed 2 years ago
@DzhideX @bwhm Please assign SPs
Proposing method for getting a list of nodes for checking - https://discord.com/channels/811216481340751934/933726271832227911/1010873958461087834
query{ distributionBucketOperators(where: {distributionBucket: {distributing_eq: true}}){ id, metadata { nodeEndpoint } } }
So it would be taking only nodes that are in distributing set - because all other ones are not interesting for us. This query would return everything we need - node itself + family:bucket-woerker
@DzhideX @bwhm Still waiting for a SP estimate for this
Hey, I agree tools like these are needed, but discord doesn't seem like the right platform?
I would imagine both Workers and Leads deploy a system that alerts them on email, or some app that doesn't post notifications unless an "urgent" action is required. A discord notification would likely drown most of the time...?
I think we can close this ticket - we've got a tool for doing such checks and excluding distributors from active set automatically. https://discord.com/channels/811216481340751934/933726271832227911/1011700093356888224
Closing as per comment above
To document the result, move to Done
I have read comments in Github about bot that we discussed, and looks like @bwhm does not find it as very effective tool. Maybe messaging to one particular messenger is not a best approach really and I agree with it, but that was a first thought that came to my mind after @klaudiusz.eth raised this issue again. And I was keeping SP bot on my mind as example. Having thought for a while I came up with my own tool that does that check automatically. It sends request to QN and gets all distributor nodes that are active at the moment and fires test requests to status endpoint of each of the nodes. In case distributor node cannot not pass this check N times (this script is run by cron) - excludes it from distributing set automatically. This script is already running and checking nodes in my group https://github.com/alexznet/JosytreamDistributorWatcher/blob/main/distributor-watcher.sh
Context Requested by Storage WG Lead AlexZNet | C.Sailors#0968 as a must-have for mainnet.
Scope Discord bot that pings distributor providers /distributor/api/v1/status endpoint. Client timeout = 250ms (configurable) Failure to receive the HTTP 200 within the specified interval yields a notification in #distributors channel.
To avoid spamming the channel, the following flow is suggested when the failing node is detected.
Record the failing node endpoint in a DB Schedule notifications every 15 minutes (configurable) that scans the DB record and sends 1 "summary" notification for all failing nodes. If node gets back up, remove the DB record. Example notification
@Distridution Worker Failing node(s) alert:
Estimate 6-8h