Joystream / atlas

Whitelabel consumer and publisher experience for Joystream
https://www.joystream.org
GNU General Public License v3.0
100 stars 44 forks source link

Playback issue #3954

Open ivanturlakov opened 1 year ago

ivanturlakov commented 1 year ago

Context https://pioneerapp.xyz/#/forum/thread/301

✅ Case 1:

URL: https://l1.media/video/364 Browser: Chrome 111.0.5563.64 (Official), (x86_64) Region: Ukraine, Kyiv Network speed: 50Mbps Load time until playback starts: 800ms(2nd try - 1300ms, 3rd try - 2300ms) Number of playback interruptions (buffering): No interruptions Bitrate (Size/Length): 1439MB/44min 25sec Chosen DP: https://sieemmanodes.com/distributor/api/v1/assets/879 Console errors, comments: No errors

⚠️ Case 2:

Number of playback interruptions (buffering): Buffering sometimes takes more than 3-5 seconds Sometimes, much more -

https://user-images.githubusercontent.com/46903215/226366782-e67b3c3b-2af7-479f-8a1c-8f9a108424cc.mov

Chosen DP: https://sieemmanodes.com/distributor/api/v1/assets/879 Console errors, comments: No errors

dmtrjsg commented 1 year ago

@ivanturlakov thanks for raising this, we will add this to the sprint which starts on Monday.

traumschule commented 1 year ago

other reported issues:

dmtrjsg commented 1 year ago

MM on Playback issues call held by JSG.

Present: @bedeho @Lezek123 @dmtrjsg @attemka

Playback issues - Solutions

1. Monitoring and Reporting

Understanding and detecting problems falls on DAO, but JSG needs to provide best opportunity for DAO team understand how well applications interact with infra.

Both downloading and uploading.

We need to introduce a standard how applications report how well things are going.

GH issue: https://github.com/Joystream/atlas/issues/3980

Priority = High. Before this one is done its not worth investing more time on other issues.

Owner: Artem


Current status on logging: we collect

ES server run by community, collects too much data, need to be reviewed as its too verbose and hard to make sense of.

ES and Kibana instance for Atlas that collects response times from distributor nodes. If errors from client perspective all collected there.

To do:

  1. Not generate more data before we analyse what we have.
  2. Connect the data that we already collect.
  3. Review and add set of tools to analyse it properly.

How easy is it to review/ collect data points / errors that we have?

Application level: There are diff type of severity logs; diff logs for debugging specific type; error logs; warnings. Implemented for both storage and distributor nodes use this.

Everything that storage and distributor leads need to know is already there.

We need to review severity levels (e.g. from warning to debugging)

Dedicated dash for Atlas:

https://atlas-services.joystream.org/kibana/app/dashboards?auth_provider_hint=anonymous1#/view/cecb4ca0-792a-11ec-8b96-2be5a346ee21?_g=(filters:!())&_a=(description:'',filters:!(),fullScreenMode:!f,options:(hidePanelTitles:!f,useMargins:!t),query:(language:kuery,query:''),tags:!(),timeRestore:!t,title:'SP%20dashboard',viewMode:view)

Leszek:We can have also instance-specific logs, to track the same for specific clients.

Artem: Availability and quality of service of distributors need to be separated and dealt separately.

It’s the time it takes to download that requires addressing.

__

How do we catch performance deterioration (takes too much time to load when skipping for instance).

__

There’s also Kibana dash exists for this: example: https://kibana.joystreamstats.live/login?next=%2F

__

JSG actions:

Not changing anything on Atlas side. We collect overview of who owns what, what access do they have.

JSG should respond to requests for additional data rather than trying to lead and enforce approach.

__

Artem: on Atlas side we only looked at distance to distributors, and not congestion (e.g. if 4 users already watching content, we didn’t factor that in).

Now that selection of distributors happens on Orion side, would be nice for distributor to add more feedback to Orion itself.

Action: issue to be raised for later, not the biggest problem whit.


Short term issue to be raised: Add monitoring to log entries for Argus to add specific query to the error. So we collect stats more precisely.


Bedeho to speak with Distributors lead and take this further.


JSG: test performance after distributors mgt is improved. There are tools to do multiple distributor requests for same content and see how this performs.

2. Multiple resolutions

Before diagnosing the problem more, not worth investing into introducing more than one.

3. Adaptive streaming

Parked for later

Summary

Biggest problem: some nodes are taking too long to deliver assets. Lead is able to detect this, and able to start operational and technical analysis, but that is not happening.

Objective: manage the distributors with aim to increase performance. This should be done before we add more data/ do any changes on apps side.