Open chrlschwb opened 10 months ago
Issue: when retrieving a file storage node intermittently exceed 5sec. Data: https://grafana.joystream.yyagi.cloud/d/VbiCFzWMz/blackbox?orgId=1&refresh=5m&from=now-24h&to=now
As the subject of this ticket the issue being reported is that the current blackbox exporter configuration probes for a specific image file every 5min in order to construct data displayed on the SWG grafana board. Occasionally and not deterministically the response time exceeds (or becomes close) to 5s this is classified as timeout by the prober. This issue appears consistently on almost all the active colossus nodes. The goals of this report are:
The analysis uses Colossus version: 3.8.1 and the tests have been executed using postman with the following HTTP request:
GET https://23.88.65.164.nip.io/storage/api/v1/files/1343
Server specs: The server is running a CPU with 16 cores and 32 GB of RAM with 1 TB bandwith, with Ubuntu Jammy Jellfish 22.04
I have used a 5 minute long load test with 20 virtual users where:
nload
(bottom left)Loom 📹 https://www.loom.com/share/ac1fdcca629343fc9472be48acf7c050?sid=6a9e8078-6fc5-4a42-87bb-da3a99638fe7
Colussus is executed as a Single thread worker node.js process, this means that the node.js runtime uses a single core out of the 16 available at any time, (which core is established by the OS scheduler, that's why on the loom video you see the load changing accross cores). In particular the main competing functions started on the server entrypoint are two:
app.listen
the express app that accepts HTTP request (this is responsible for the incoming network traffic that you see on the nload interface)runSyncWithInterval
that periodically (every 1 minute as it is the default value currently) executes a sequence of tasks
Without going into too many details I will explain how the runSyncWithInterval
works.
The sleep
and await performSync
line ensures that sleepInterval = 1 minute
is respected in between performSync
executions. This implies that performSync
execution start timestamp is: previous_performSync_start_timestamp + performSync_execution_time + 1min
and since performSync_execution_time
is variable then the prober will experience non deterministic timeouts accross time.
The performSync
function will compare local bucket state against the on-chain bucket state (coming from the QN) and it will prepare a list of objectsId
to add and delete then for each new asset to be added it calls getPrepareDownloadTasks
factory method that sets up other storage nodes urls from which the assets are fetched (this ensure that assets are propagated in between nodes hosting the same bucket). The getPrepareDownloadTask
will eventually cause the PrepareDownloadTask.execute
async method to be executed which internally uses node-cache: 5.1.1
to store valid and reliable storage operator urls (and also mark unrealiable ones) from which the asset file is then downloaded using DownloadTask.execute
, this should cause the outgoing network traffic to rise.
There are segments on the loom video where the outgoing and incoming network traffic are both high, suggesting that the DownloadTask.execute
doesn't cause interference on the network total bandwith. What it emerges is that during the syncing process is that one core of the 16 available has a high load possibly above 80%, this is observable even when no GET request are initiated. This suggest that the app.listen
and runSyncWithInterval
are essentially competing for the same core resource usage and specifically this bottleneck seems to be caused by PrepareDownloadTask.execute
, furthermore inspecting the logs the high loads appears in conjuction with the cache lookup for valid operator urls on the node nodeCache
(inside getRemoteDataObjects
) or even the storage operator url selection as a whole. In any case the bottleneck appears to be due to the code insidePrepareDownloadTask.execute
I think at this point that the app.listen
part (i.e. the "rest api process") could be executed on its own thread, this will eliminate the competition for the same core with the synching routine. I have checked with @zeeshanakram3 and he assured me that cpu resource usage is a known problem in colossus and this fix is in the pipeline
Nice work taking these measurements. Which version of colossus did you test against?
Could you repeat the same test for latest version v3.10.0
I will try @mnaamani, with the blessing of @yasiryagi
I will try @mnaamani, with the blessing of @yasiryagi
If you haven't performed the benchmark yet, there is a v3.10.1 release docker image already published. joystream/storage-node:3.10.1
I have tried it with latest colossus and the problem seems to be gone, however there's still some issue when the asset file hash is mismatched https://www.loom.com/share/0e2ae18ec8fb410da80b3d37e065731b?sid=fcb2089e-3ab1-4eb1-acdf-48437f85e3af
Nice work. Just to clarify how many gigahertz per core were used in the test? Were you able to reproduce 5s delays? What is the consequent minimal server requirement for SP in terms of CPU? How do you suggest to improve the caching function for future huge numbers of objects?
I have tried it with latest colossus and the problem seems to be gone, however there's still some issue when the asset file hash is mismatched
Can you share a bit more about the problem you're describing? It's not immediately apparent to me from the loom video
1- Add storage node to the probe and assign it to the test channels only. 2- Enable tracing on the node 3- Define and document of the various storage events, specially process for a GET and POST events 4- Identify bottleneck for GET and POST event 5- Propose a fix