Oracle containers seem to restart periodically

Andrew-Pohl commented 4 years ago

I have noticed on my nodes that the oracle containers are periodically restarting themselves due to errors. I have attached an example log from the initiate-change container. It has been observed on the following containers:

• initiate-change • collected-signatures • signature-request • rewarded-on-cycle • affirmation-request

I did notice that I am hitting the 250MB limit on some of the oracle containers which may be resulting in the issues.

log.txt

LiorRabin commented 4 years ago

Need to understand why the memory is hitting that limit - could possibly be a memory leak.

Andrew-Pohl commented 4 years ago

Need to understand why the memory is hitting that limit - could possibly be a memory leak.

Yeah it seems like a leak, it starts around 50MB and continues to grow up to the 250 limit in ~16hours. I spoke to Rob and he has the same issues on his nodes

CrackerHax commented 4 years ago

Same here. On the fuseapp I am getting this error:

[2020-05-15 23:40:44.223 +0000] INFO  (1 on 2a7aac20d111): emitRewardedOnCycle
[2020-05-15 23:42:54.175 +0000] ERROR (1 on 2a7aac20d111): Invalid JSON RPC response: ""
    Error: Invalid JSON RPC response: ""
        at Object.InvalidResponse (/node_modules/truffle-hdwallet-provider/dist/index.js:15:637106)
        at t.i.onreadystatechange (/node_modules/truffle-hdwallet-provider/dist/index.js:15:748901)
        at t.e.dispatchEvent (/node_modules/truffle-hdwallet-provider/dist/index.js:1:142373)
        at t._setReadyState (/node_modules/truffle-hdwallet-provider/dist/index.js:15:753708)
        at t._onHttpRequestError (/node_modules/truffle-hdwallet-provider/dist/index.js:15:757376)
        at ClientRequest.<anonymous> (/node_modules/truffle-hdwallet-provider/dist/index.js:15:754873)
        at ClientRequest.emit (events.js:198:13)
        at TLSSocket.socketErrorListener (_http_client.js:392:9)
        at TLSSocket.emit (events.js:198:13)
        at emitErrorNT (internal/streams/destroy.js:91:8)
        at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)
        at process._tickCallback (internal/process/next_tick.js:63:19)

And on other containers I get this:

npm ERR! code ELIFECYCLE
npm ERR! errno 11
npm ERR! native-to-erc20-bridge-oracle@2.0.1 watcher:collected-signatures: ./scripts/start-worker.sh watcher collected-signatures-watcher
npm ERR! Exit status 11
npm ERR!
npm ERR! Failed at the native-to-erc20-bridge-oracle@2.0.1 watcher:collected-signatures script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-05-15T23_50_10_352Z-debug.log

{"level":60,"time":1589585661282,"msg":"Max processing time reached","name":"watcher-native-erc-rewarded-on-cycle","v":1}
npm ERR! code ELIFECYCLE
npm ERR! errno 11
npm ERR! native-to-erc20-bridge-oracle@2.0.1 watcher:rewarded-on-cycle: ./scripts/start-worker.sh watcher rewarded-on-cycle-watcher
npm ERR! Exit status 11
npm ERR!
npm ERR! Failed at the native-to-erc20-bridge-oracle@2.0.1 watcher:rewarded-on-cycle script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-05-15T23_34_21_464Z-debug.log

> native-to-erc20-bridge-oracle@2.0.1 watcher:rewarded-on-cycle /bridge-oracle
> ./scripts/start-worker.sh watcher rewarded-on-cycle-watcher

Andrew-Pohl commented 4 years ago

Disabling THP seems to fix the restarting issues. Tested by myself and Rob for ~48hrs combined. The memory usages increases and gets freed close to the limit without causing the containers to restarts.

Have added some changes to the compose file and qs to fix these issues and put in PRs here: https://github.com/fuseio/fuse-bridge/pull/18 https://github.com/fuseio/fuse-network/pull/36

Probably requires some more time to test, but initial signs look promising! :)

Andrew-Pohl commented 4 years ago

Ok looked at this further today/ last night added some more log traces to the containers to figure out what exactly is going on.

The cause of the issue is that watcher.js tries to add data to the amqp queue, but the container has ran out of memory. Erlang then reshuffles the amqp container (either by dropping slots or writing to disk) to free up space, this slows the container down and trips the watchdog check. The reason for this is that the oracle containers are only given 10% cpu access, on my machine upping to 15% solved the watchdog triggering issue. So have upped to 25% in git for safety.

The memory increase is not a leak it's just rabbit (amqp) filling it's buffers, queues and databases which is expected behavior. GC gets triggered when OOMed (or close to it) which clears some memory.

Tested on a kvm with 3vcpu (Xeon E3-1231v3) with 15% cap on oracle container: • Run time as of writing 40hrs, no container errors or restarts • Ram usage at start 900MB after ~12hrs 1.6-1.8GBs (stays here for the duration) • Swap usage didn't check at start but stabilizes around 1.3GB • 15min load average ~0.44 (~15% cpu usage)

dockers

I will continue to monitor but previously the containers on my machine were consistently dying after 8-16hrs.

Obviously this 25% may need to be adjusted depending on validators system specs so might be worth mentioning it somewhere in the docs and review at a later date :)

Andrew-Pohl commented 4 years ago

@LiorRabin I just noticed that the CPU limitations for docker is relative per core. IE if two cores are present the max is 2.0 (it doesn't scale to 1.0), this means that any more than one core is currently completely wasted!. I can modify the quick start to scale the number depending on how many CPU cores are present if needed?

LiorRabin commented 4 years ago

@Andrew-Pohl Go for it :)

fuseio / fuse-bridge

Oracle containers seem to restart periodically #17