Joystream / youtube-synch

YouTube Synchronization
11 stars 10 forks source link

Fault Tolerance Plan for YT-Synch infrastructure #149

Closed zeeshanakram3 closed 1 year ago

zeeshanakram3 commented 1 year ago

Addresses #129

Fault Tolerance Testing Plan for the YT-synch Service:

Introduction

This document outlines a comprehensive testing plan for ensuring the fault tolerance of the YT-sync service. The service has two main components: a web app (used for onboarding the creators to the YPP program) and a backend application - which includes persistent storage for tracking the state of the Youtube channels & infrastructure to sync videos from Youtube to Joystream network. The web app enables creators to authorize their Google accounts using OAuth 2.0 workflow; This authorization workflow returns an authorization_code, which the frontend app passes to the backend, then the backend uses the authorization code to get the access token, and all of the privileged actions (e.g., fetching videos info for a given channel) then require the use of access_token. The syncing service downloads the videos from YouTube and uploads them to the Joystream network.

Objectives

This testing plan aims to validate that the service can handle unexpected errors, failures, and exceptions and ensure that the service can continue to operate normally in the event of a failure. After analyzing the failures that could potentially occur and halt the infrastructure, we can categorize them into two types:

The plan will mainly cover fault tolerance in the case of external APIs:

External dependencies/APIs

RPC node

Although the RPC node is a critical component, however, temporary unavailability should not completely halt the service, so the intended behavior when RPC API isn't working is:

Query Node

The intended behavior when Query node API isn't working is:

Storage Node

The intended behavior when Storage node API isn't working is:

Google API

Google API is one of the most important components in the whole of the YT-sync infrastructure; it is essential in user signup workflow and polling channels' state in synch workflow; here are the type of errors & exceptions due to Google API:

Google API service is down

Handling other negative cases

Other failure cases could actually affect the functioning of the infrastructure, so mentioning them here:

Developing Test cases for External dependencies

Recovering from Failure

Recovering from failure in case of external dependency can include manual intervention and not depending upon the fact that if API endpoint becomes available after some time, Otherwise wen needs a manual change to the config file to set new (say RPC or QN) endpoints.


Enhancements

Possible enhancements to the YT-synch infrastructure I can think of to ensure the smooth working of the service

Database replication

Currently, the YT sync uses a single instance of AWS Dynamodb tables to store the state of participant Youtube channels & their videos(e.g., what videos are new, what have been synced, etc.). Dynamo is highly scalable, so there won't be any perceived problems regarding scalability. However, the data replication problem still needs to be addressed as it isn't provided out of the box, so in case of any failure or some unexpected/unintended action from the developer (e.g., accidentally deleting the table), we need to make sure that the replica instance is safe from any such failure. I did some research on this, and this effort shouldn't take long, so I will work on enabling the data replication.

Handling ratelimit from Youtube due to requests overload

The youtube downloader library (ytdl) we are using for downloading the videos states that YT could enforce the rate limit on requests from a given IP due to excessive assets download requests(remember, this rate limit is different from the API quota limit). The documentation states that there are multiple ways to solve this problem, i.e., using a proxy or rotating IPv6 addresses. I think it is not required to tackle this problem right now, as for some time in the future, we won't reach the state where YT would enforce rate limiting.

Adding collaborator balance info into YPP API's /status endpoint

if the collaborator account runs out of funds, the video creation won't work. So to tackle this, add collaborator balance info in YPP API /status endpoint; so that the infrastructure operator is aware that it needs to top-up the account.

┆Issue is synchronized with this Asana task by Unito

dmtrjsg commented 1 year ago

Excellent work @zeeshanakram3, nicely thought through!

Few questions:

  1. RPC, QN, Storage Nodes outage

When we say the service does not work, what type of error would user be exposed to? Would it be possible to serve a interpretable error code for FE to display to users?

  1. Collaborator account runs out of funds. Ensure that the infrastructure operator knows he needs to top-up the account.

How does operator find out about this without interrogating the APi? What is the impact on users? Are we able to have some sort of status check in the Operator Web Tool we are planning to build? Is there an MVP way to get some sort of error notification for us, the supporting team?

  1. Recovering from failure in case of external dependency can include manual intervention and not depending upon the fact that if API endpoint becomes available after some time,

    What do you mean here? What has to be done here specifically?

  2. I think it is not required to tackle this problem right now, as for some time in the future, we won't reach the state where YT would enforce rate limiting.

When do you think we would need to figure smth out? What's the limit in numbers on videos downloaded concurrently/ requests to YT?

dmtrjsg commented 1 year ago

1️⃣

When we say the service does not work, what type of error would user be exposed to?

Creators will not be exposed to these errors, since they are interacting with Atlas and Atlas does not know about anything apart from Onboarding component.

If error is in the onboarding component then the error will be returned > (Google API is down or QN is down or Quota is finished). Error codes exist and returned by BE for:

2️⃣

How does operator find out about collab funds run out.

Notifications are not build for this, but can be done with API request as MVP. We will add the collaborator balance display to the YPP Operator Tool.

3️⃣ If YT API is down for some (any time even when long) time, there should not be manual intervention by developer, it will start working again. For internal we need to test if intervention is needed for recovery.

4️⃣ Rate limit ceiling is not publicly disclosed and cannot be tested. This is not a blocker for release, but we will work for it after release. @zeeshanakram3 will prepare issue with ypp-v1 label and we will add it to YPP Asana project.

dmtrjsg commented 1 year ago

DM to test:

dmtrjsg commented 1 year ago

Envs: Infra tests - local env Collaborator - new env that Leszek created today the one that is used for Apps YPP testing.

bedeho commented 1 year ago

Will there be a report of findings here?

zeeshanakram3 commented 1 year ago

Will there be a report of findings here?

@bedeho Sure, I will share the QA findings here.

dmtrjsg commented 1 year ago

@bedeho we have a separate issue for tracking execution, so that's the reason why this was closed (from Asana, during our last call) wo outcomes posted.