LiskArchive / lisk-sdk

🔩 Lisk software development kit
https://lisk.com
Apache License 2.0
2.72k stars 457 forks source link

Lisk Core explodes when used in conjunction with Lisk Service #9187

Closed Lemii closed 6 months ago

Lemii commented 6 months ago

It is currently not possible to run a stable mainnet Lisk Core + Lisk Service server.

Expected behavior

Lisk Core and Lisk Service should run stable without problems.

Actual behavior

When running Core and LS together, at some point the whole server becomes slow and unresponsive (even though the CPU and memory usage do not appear high..... yet). From that point on, the memory usage of Core keeps increasing with hundreds of megabytes every few seconds. Even if you stop all processes except Core, the memory still keeps rising. It continues to do this until the server runs out of memory and all PM2 processes are forcefully stopped. There are no additional error logs.

After the explosion, PM2 restarts with all processes saved in the pm2 save configuration.

This issue does not seem to occur when running Core without LS.

The issue also occurs while indexing.

Steps to reproduce

Which version(s) does this affect? (Environment, OS, etc...)

Lemii commented 6 months ago

After creating this issue, I removed all pm2 processes using the pm2 delete command. Afterwards, I restarted them one by one. It's now a few hours later, and no explosion has occurred thus far.

Perhaps the problem was that a process with an old version was saved in the pm2 startup config, and that by deleting the processes (rather than restarting them) it forced pm2 to use the new version?

Either way, I see v0.7.2 has been tagged. I'll try that branch, re-sync and re-index the node from scratch, and will report back if I run into more problems.

Lemii commented 6 months ago

Issues soon started arising after my last update. It was a battle-worn machine, having gone through multiple OS upgrades, a network migration, as well as a overall rescale to bigger specs. I gave up and started on a fresh machine with v0.7.2.

Syncing went super quick. Performance appeared stable afterwards. But then in a sudden it crashed just now with the following error:

0|lisk-core                             | <--- Last few GCs --->
0|lisk-core                             | [4010:0x6afb870] 13557675 ms: Scavenge 2036.3 (2051.6) -> 2036.0 (2062.6) MB, 31.5 / 0.0 ms  (average mu = 0.817, current mu = 0.820) allocation failure;
0|lisk-core                             | [4010:0x6afb870] 13557903 ms: Mark-sweep (reduce) 2042.8 (2062.6) -> 2041.9 (2055.6) MB, 65.8 / 0.0 ms  (+ 116.4 ms in 97 steps since start of marking, biggest step 111.6 ms, walltime since start of marking 215 ms) (average mu = 0.681, current mu = 0.408)
0|lisk-core                             | <--- JS stacktrace --->
0|lisk-core                             | FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory

I haven't seen this particular error log before. When my previous node would crash, it would just kill all processes without outputting any logs.

Lemii commented 6 months ago

FYI: The new server has survived the night and is running stable. There were no restarts or crashes. With all processes combined it is now hovering around 3.5gb mem usage (1.4gb being Lisk Core). The previous heap limit crash might've been a weird edge case.

Lemii commented 6 months ago

Back with another update.. things have been relatively stable, but far from perfect. There have been no major crashes, but the LS HTTP API does become slow and unresponsiveness every few hours (I use it with one of my private tools and every once in a while it simply can't connect). When this happens, LS stops indexing and the logs will be filled with 'endpoint x not available' (example below). I'm not sure how it is resolved, but I suppose some process restarts and everything is alright again.

To ensure stable performance in these high traffic times, I will delete and restart all processes every 2 or 3 hours for the time being. Not perfect, but workable.

Some example logs:

34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:00 769: 2023-12-14T19:21:00.769 WARN [BROKER] Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:00 770: 2023-12-14T19:21:00.770 ERROR [microservice] Error occurred! Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:01 358: 2023-12-14T19:21:01.357 WARN [BROKER] Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:01 358: 2023-12-14T19:21:01.358 ERROR [microservice] Error occurred! Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:01 670: 2023-12-14T19:21:01.670 WARN [BROKER] Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:01 671: 2023-12-14T19:21:01.670 ERROR [microservice] Error occurred! Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:02 211: 2023-12-14T19:21:02.211 INFO [indexStatus] currentChainHeight: 23462681, lastIndexedBlockHeight: 23462676
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:02 212: 2023-12-14T19:21:02.211 INFO [indexStatus] Block index status: 71685/71690 blocks indexed (99.99%).
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:02 986: 2023-12-14T19:21:02.986 WARN [BROKER] Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:02 987: 2023-12-14T19:21:02.986 ERROR [microservice] Error occurred! Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:03 573: 2023-12-14T19:21:03.572 WARN [BROKER] Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:03 573: 2023-12-14T19:21:03.573 ERROR [microservice] Error occurred! Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:03 863: 2023-12-14T19:21:03.862 WARN [BROKER] Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:03 864: 2023-12-14T19:21:03.863 ERROR [microservice] Error occurred! Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:05 128: 2023-12-14T19:21:05.127 WARN [BROKER] Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:05 128: 2023-12-14T19:21:05.128 ERROR [microservice] Error occurred! Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:05 740: 2023-12-14T19:21:05.740 WARN [BROKER] Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:05 741: 2023-12-14T19:21:05.740 ERROR [microservice] Error occurred! Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:06 034: 2023-12-14T19:21:06.034 WARN [BROKER] Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:06 034: 2023-12-14T19:21:06.034 ERROR [microservice] Error occurred! Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:06 829: 2023-12-14T19:21:06.829 INFO [accountBalanceIndex] Successfully updated account balances for 0 account(s).
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:06 829: 2023-12-14T19:21:06.829 INFO [updateAccounts] Triggered account balance updates successfully.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:07 319: 2023-12-14T19:21:07.319 WARN [BROKER] Service 'connector.getNetworkStatus' is not available.
34|lisk-service-blockchain-indexer       | 2023-12-14 19:21:07 319: 2023-12-14T19:21:07.319 ERROR [microservice] Error occurred! Service 'connector.getNetworkStatus' is not available.
shuse2 commented 6 months ago

This is solved in https://github.com/LiskHQ/lisk-service/releases/tag/v0.7.3