Joystream / joystream

Joystream Monorepo
http://www.joystream.org
GNU General Public License v3.0
1.42k stars 115 forks source link

[colossus] Identify problems in the Storage Node #4952

Open chrlschwb opened 10 months ago

chrlschwb commented 10 months ago

1- Add storage node to the probe and assign it to the test channels only. 2- Enable tracing on the node 3- Define and document of the various storage events, specially process for a GET and POST events 4- Identify bottleneck for GET and POST event 5- Propose a fix

yasiryagi commented 9 months ago

Issue: when retrieving a file storage node intermittently exceed 5sec. Data: https://grafana.joystream.yyagi.cloud/d/VbiCFzWMz/blackbox?orgId=1&refresh=5m&from=now-24h&to=now

ignazio-bovo commented 8 months ago

Storage issue

Intro

As the subject of this ticket the issue being reported is that the current blackbox exporter configuration probes for a specific image file every 5min in order to construct data displayed on the SWG grafana board. Occasionally and not deterministically the response time exceeds (or becomes close) to 5s this is classified as timeout by the prober. This issue appears consistently on almost all the active colossus nodes. The goals of this report are:

Analysis

Load testing

The analysis uses Colossus version: 3.8.1 and the tests have been executed using postman with the following HTTP request:

GET https://23.88.65.164.nip.io/storage/api/v1/files/1343

Server specs: The server is running a CPU with 16 cores and 32 GB of RAM with 1 TB bandwith, with Ubuntu Jammy Jellfish 22.04

I have used a 5 minute long load test with 20 virtual users where:

Loom 📹 https://www.loom.com/share/ac1fdcca629343fc9472be48acf7c050?sid=6a9e8078-6fc5-4a42-87bb-da3a99638fe7

Key observations

Colossus stack trace explanation

Main execution

Colussus is executed as a Single thread worker node.js process, this means that the node.js runtime uses a single core out of the 16 available at any time, (which core is established by the OS scheduler, that's why on the loom video you see the load changing accross cores). In particular the main competing functions started on the server entrypoint are two:

Recommendation

I think at this point that the app.listen part (i.e. the "rest api process") could be executed on its own thread, this will eliminate the competition for the same core with the synching routine. I have checked with @zeeshanakram3 and he assured me that cpu resource usage is a known problem in colossus and this fix is in the pipeline

mnaamani commented 8 months ago

Nice work taking these measurements. Which version of colossus did you test against? Could you repeat the same test for latest version v3.10.0

ignazio-bovo commented 8 months ago

I will try @mnaamani, with the blessing of @yasiryagi

mnaamani commented 8 months ago

I will try @mnaamani, with the blessing of @yasiryagi

If you haven't performed the benchmark yet, there is a v3.10.1 release docker image already published. joystream/storage-node:3.10.1

ignazio-bovo commented 8 months ago

I have tried it with latest colossus and the problem seems to be gone, however there's still some issue when the asset file hash is mismatched https://www.loom.com/share/0e2ae18ec8fb410da80b3d37e065731b?sid=fcb2089e-3ab1-4eb1-acdf-48437f85e3af

traumschule commented 8 months ago

Nice work. Just to clarify how many gigahertz per core were used in the test? Were you able to reproduce 5s delays? What is the consequent minimal server requirement for SP in terms of CPU? How do you suggest to improve the caching function for future huge numbers of objects?

kdembler commented 8 months ago

I have tried it with latest colossus and the problem seems to be gone, however there's still some issue when the asset file hash is mismatched

Can you share a bit more about the problem you're describing? It's not immediately apparent to me from the loom video