bedeho commented 2 years ago

Background

The lead of the storage group needs a way to learn what storage providers are having trouble in an automated way. Right now, there are tools which allow the lead to perform certain live benchmarks, I believe primarily on downloading content, however this is an active step that has to be initiated, and probably therefore happens relatively rarely. The most acute symptom of this today is that uploads are at times frequently failing, and this is a very bad experience for creators, who are by far the most important next audience segment for Joystream. Right now, this is being solved by having application operators (e.g. @kdembler ), manually reach out to the lead, which does not scale. Preemptive detection of such failures through active interrogation would largely solve this problem.

To be clear, the purpose of this tooling is not to detect active adversarial providers, but rather faults that providers themselves are unaware of due to misconfiguration, resource exhaustion or other unintentional factors.

Proposal

An online service which continuously attempts to interrogate storage providers and reports the results of such interrogation both to some third party data warehouse, through some API, but importantly also notifies the lead operating the infrastructure about failures. It's not clear if the automation should be outsourced to the warehouse or directly as part of the service, but it has to be part of the overall package in some way.

The scope of this tool can grow over time, but the most important part of this initial service scope is to check whether trying to upload assets works or not. Failures could be any among

inability to resolve host
inability to connect to host
inability to initiate upload
upload progresses too slowly
upload is prematurely terminated by the host

The simplest approach is to have the service use some predefined membership+channel for such interrogations, and upload assets just to that same channel. Cleanup can be done regularly by the same service. Channel should be set as unlisted in metadata, to not occur in apps.

Question

I also wonder whether we have proper tooling to allow storage providers to detect a subset of such failures on their end, and if such detection actually results in pushing data out to the operator through some channel.

┆Issue is synchronized with this Asana task by Unito

0x2bc commented 2 years ago

For me, that sounds great. Just few minor comments: (1) It's really good to know which node got failed object upload event for a specific object. Now we don't have this info. This may somehow relate to the Elastic Search Logging @yasiryagi (2) You mention "upload progresses too slowly", but it may relate to the geo position of the node. Now if a user wants to upload object the storage node will be chosen randomly. Its geo position is not taken into account. So it can be tricky to test

bedeho commented 2 years ago

Perhaps we can just quickly address:

1) what events are currently being sent to the Elastic Search logger operated by the lead? 2) is there some built in way to send notifications from Elastic Search over say email or something else which grabs attention?

yasiryagi commented 2 years ago

Health check

Develop a system to track and act on the health of the nodes.

Group nodes into pools of a certain size, could be the size of replication variable.
Each pool randomly select a 1 or 2 leads to be issuing the health checks. Each node will keep track of the health checks/keep a live they get from the leads, and upon not receiving it for a certain amount of time a new lead to be selected. The health check try to upload and retrieve a small file to the node at a certain interval of time.
Health check results:
- Success
- Failed
- inability to resolve host
- inability to connect to host
- Upload failed
- Retrieve failed

Storage node states:

Active: Successful health check
Maintenance : Node operator want his node operational.
Failed : 3 Failed health check within a time window.
Recovering : Successful health check after some time of failed state.

traumschule commented 2 years ago

Upload errors are currently fetched from sentry every 5 minutes: https://joystreamstats.live/static/upload-errors.json and visualized on https://joystreamstats.live/distribution This can only be temporary solution and won't catch uploads from non-gleev.

I can set up ES email alerts. For this someone needs to investigate what is collected (packetbeat, metricbeat) and define filters. It will also help ti reduce the amount stored (rollups) - so far more than 1.7 TB were collected and the partition has 13GB. This should happen soon or new data can't be stored.

For automated upload tests and benchmarking, status checks there are several issues already:

Joystream / joystream

`Lighthouse` - An automated storage interrogation service #4270

Background

Proposal

Question

Health check