Joystream / joystream

Joystream Monorepo
http://www.joystream.org
GNU General Public License v3.0
1.42k stars 115 forks source link

Investigate Argus crash on 2024-03-21 #5111

Closed ignazio-bovo closed 5 months ago

ignazio-bovo commented 7 months ago

TLDR

Ping at 7.40am CET on 2024-03-21 revealed that multiple nodes have crashed

Image

kdembler commented 7 months ago
2024-03-21 06:41:38:4138 StorageNodeApi error: Request timeout of 5000ms reached
{
    "0": {
        "endpoint": "https://sieemmastorage.com/storage/api/v1"
    },
    "timeoutMs": 5000,
    "trace_id": "e189941ce41181ed61025e9e07b8e34c",
    "span_id": "0a86f2d94e50bc5e",
    "trace_flags": "01"
}
2024-03-21 06:41:38:4138 StorageNodeApi error: Unexpected error while requesting data object
{
    "0": {
        "endpoint": "https://sieemmastorage.com/storage/api/v1"
    },
    "objectId": "2513914",
    "err": {
        "message": "Request timeout"
    },
    "trace_id": "e189941ce41181ed61025e9e07b8e34c",
    "span_id": "0a86f2d94e50bc5e",
    "trace_flags": "01"
}
2024-03-21 06:41:38:4138 NetworkingManager error: Data object download failed
{
    "err": {
        "message": "Failed to download object 2513914 from any availablable storage provider",
        "stack": "Error: Failed to download object 2513914 from any availablable storage provider\n    at fail (/joystream/distributor-node/lib/services/networking/NetworkingService.js:224:24)\n    at Queue.<anonymous> (/joystream/distributor-node/lib/services/networking/NetworkingService.js:265:21)\n    at Queue.emit (node:events:517:28)\n    at Queue.done (/joystream/node_modules/queue/index.js:194:8)\n    at next (/joystream/node_modules/queue/index.js:118:16)\n    at /joystream/node_modules/queue/index.js:150:14\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at runNextTicks (node:internal/process/task_queues:64:3)\n    at listOnTimeout (node:internal/timers:538:9)\n    at process.processTimers (node:internal/timers:512:7)"
    },
    "trace_id": "e189941ce41181ed61025e9e07b8e34c",
    "span_id": "0a86f2d94e50bc5e",
    "trace_flags": "01"
}
2024-03-21 06:41:38:4138 PublicApi error: middlewareError
{
    "err": {
        "message": "Failed to download object 2513914 from any availablable storage provider",
        "stack": "Error: Failed to download object 2513914 from any availablable storage provider\n    at fail (/joystream/distributor-node/lib/services/networking/NetworkingService.js:223:25)\n    at Queue.<anonymous> (/joystream/distributor-node/lib/services/networking/NetworkingService.js:265:21)\n    at Queue.emit (node:events:517:28)\n    at Queue.done (/joystream/node_modules/queue/index.js:194:8)\n    at next (/joystream/node_modules/queue/index.js:118:16)\n    at /joystream/node_modules/queue/index.js:150:14\n    at processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at runNextTicks (node:internal/process/task_queues:64:3)\n    at listOnTimeout (node:internal/timers:538:9)\n    at process.processTimers (node:internal/timers:512:7)"
    },
    "req": {
        "url": "/api/v1/assets/2513914",
        "method": "GET",
        "httpVersion": "1.1",
        "originalUrl": "/api/v1/assets/2513914",
        "query": {}
    },
    "trace_id": "e189941ce41181ed61025e9e07b8e34c",
    "span_id": "0a86f2d94e50bc5e",
    "trace_flags": "01"
}
2024-03-21 06:41:38:4138 PublicApi http: HTTP GET /api/v1/assets/2513914
{
    "meta": {},
    "trace_id": "e189941ce41181ed61025e9e07b8e34c",
    "span_id": "0a86f2d94e50bc5e",
    "trace_flags": "01"
}

<--- Last few GCs --->

[7:0x57ad870] 121787624 ms: Mark-sweep 4042.3 (4129.7) -> 4038.5 (4126.1) MB, 1341.4 / 0.0 ms  (average mu = 0.242, current mu = 0.079) allocation failure; scavenge might not succeed
[7:0x57ad870] 121789771 ms: Mark-sweep 4056.1 (4127.9) -> 4054.2 (4157.8) MB, 2132.2 / 0.0 ms  (average mu = 0.139, current mu = 0.007) allocation failure; scavenge might not succeed

<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0xb95b60 node::Abort() [node]
 2: 0xa9a7f8  [node]
 3: 0xd6f2f0 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
 4: 0xd6f697 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
 5: 0xf4cba5  [node]
 6: 0xf5f08d v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
 7: 0xf3978e v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 8: 0xf3ab57 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
 9: 0xf1bd2a v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
10: 0x12e114f v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
11: 0x170deb9  [node]
/joystream/distributor-node/runner.sh: line 8:     7 Aborted                 (core dumped) node --require @joystream/opentelemetry ./bin/run $*
Loaded Application Instrumentation: "Distributor Node"
Starting tracing..

There are hundreds of logs at the exact same second, all about not being able to download an object. I think there may be an infinite loop/recursion somewhere there and it just runs out of memory.

ignazio-bovo commented 7 months ago

This looks like very hard to reproduce. I also suspect that the download is failing because there's no sufficient HEAP space for the file to be stored in memory before it gets saved in the disk (or somewhat along these lines), so the error might be somewhere else as pointed out by Klaudiusz. I would leave this issue open, and if the error represents itself often (like at least once per week) then proceed with a proper investigation and I won't do nothing in the meantime as this looks very time consuming to reproduce and also I am not the one who wrote the Argus code. Let me know what do you think @kdembler @zeeshanakram3