IntersectMBO / cardano-node

The core component that is used to participate in a Cardano decentralised blockchain.
https://cardano.org
Apache License 2.0
3.06k stars 721 forks source link

[BUG] - Memory usage out of control, node crashes #5323

Closed Titan-C closed 1 year ago

Titan-C commented 1 year ago

External Other

Summary Node crashes during regular operation and I can't sync to the network

Steps to reproduce Steps to reproduce the behavior:

  1. Using hydra binaries 8.0 or compiling the node locally run the node on mainnet
  2. let it run
  3. I runs out of memory and crashes

Expected behavior The node keeps working without crash

System info (please complete the following information):

Screenshots and attachments this is the systemd logfile

Jun 15 02:40:51 darktitan cardano-node[3030]: [darktita:cardano.node.ChainDB:Notice:338] [2023-06-15 00:40:51.65 UTC] Chain extended, new tip: 0240cd67362721132b6a37d67eefb3aac7b6e8c5651967db8880946e5c7ee008 at slot 72316443
Jun 15 02:40:52 darktitan cardano-node[3030]: [darktita:cardano.node.ChainDB:Notice:338] [2023-06-15 00:40:52.94 UTC] Chain extended, new tip: 16c94ac2ff9a1122140d6043d984ba99cb7dd82cbdec50380b9e34036c978ee2 at slot 72316788
Jun 15 02:42:37 darktitan cardano-node[3030]: cardano-node: internal error: Unable to commit 1048576 bytes of memory
Jun 15 02:42:37 darktitan cardano-node[3030]:     (GHC version 8.10.7 for x86_64_unknown_linux)
Jun 15 02:42:37 darktitan cardano-node[3030]:     Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
Jun 15 02:43:50 darktitan systemd[1]: cardano-node-mainnet.service: Main process exited, code=dumped, status=6/ABRT
Jun 15 02:43:50 darktitan systemd[1]: cardano-node-mainnet.service: Failed with result 'core-dump'.
Jun 15 02:43:50 darktitan systemd[1]: cardano-node-mainnet.service: Consumed 10h 49min 1.068s CPU time.
Jun 15 02:44:50 darktitan systemd[1]: cardano-node-mainnet.service: Scheduled restart job, restart counter is at 1.
Jun 15 02:44:50 darktitan systemd[1]: Stopped Cardano node Mainnet.
Jun 15 02:44:50 darktitan systemd[1]: cardano-node-mainnet.service: Consumed 10h 49min 1.068s CPU time.

After this crash it takes a long while to to validate again the DB, well it says validating chunks. Then it replays blocks goes back to syncing and crashes again

Jun 15 03:16:11 darktitan cardano-node[22403]: [darktita:cardano.node.ChainDB:Notice:1875] [2023-06-15 01:16:11.18 UTC] Chain extended, new tip: 32b7167f07d88301767f917a01359e8932026629f2dc833280a817bce52f47ff at slot 72326379
Jun 15 03:16:12 darktitan cardano-node[22403]: [darktita:cardano.node.ChainDB:Notice:1875] [2023-06-15 01:16:12.43 UTC] Chain extended, new tip: 325105b8d1d9eeba1510baa4c3650ce8a73b959e9b6dcdaef081a754a3434a89 at slot 72326901
Jun 15 03:16:13 darktitan cardano-node[22403]: [darktita:cardano.node.ChainDB:Notice:1875] [2023-06-15 01:16:13.69 UTC] Chain extended, new tip: 179524c8440abf6aafb4b49e89fbc546f777f8f25d840dc1d7389c96e0aa0795 at slot 72327447
Jun 15 03:16:16 darktitan cardano-node[22403]: cardano-node: internal error: Unable to commit 1048576 bytes of memory
Jun 15 03:16:16 darktitan cardano-node[22403]:     (GHC version 8.10.7 for x86_64_unknown_linux)
Jun 15 03:16:16 darktitan cardano-node[22403]:     Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug
Jun 15 03:17:30 darktitan systemd[1]: cardano-node-mainnet.service: Main process exited, code=dumped, status=6/ABRT
Jun 15 03:17:30 darktitan systemd[1]: cardano-node-mainnet.service: Failed with result 'core-dump'.
Jun 15 03:17:30 darktitan systemd[1]: cardano-node-mainnet.service: Consumed 33min 12.853s CPU time.
Jun 15 03:17:41 darktitan systemd[1]: Stopped Cardano node Mainnet.
Jun 15 03:17:41 darktitan systemd[1]: cardano-node-mainnet.service: Consumed 33min 12.853s CPU time.

This is my monitoring log with netdata. You see that the cardano-node-mainnet systemd service used 200% cpu, that is two cores. I'm still syncing to the network and the memory usage seems to grow linearly. You see the first crash at 2:43 that clears all memory use. After the immediate restart it uses 100% CPU(one core) and memory grows rapidly up to a flat max, which is higher than the previous max memory used before the crash. The instant when the memory usage reached this max
corresponds in the log to

Jun 15 03:04:10 darktitan cardano-node[22403]: [darktita:cardano.node.ChainDB:Info:5] [2023-06-15 01:04:10.91 UTC] Validated chunk no. 2763 out of 3345. Progress: 82.60%
Jun 15 03:04:10 darktitan cardano-node[22403]: [darktita:cardano.node.ChainDB:Info:5] [2023-06-15 01:04:10.91 UTC] Validating chunk no. 2764 out of 3345. Progress: 82.60%

cardano-node-crash

Additional context I run the node on a desktop computer with 64GB of ram. Given that your current recommendation for the node is 24GB ram, I should have more than enough. Yet what netdata monitoring shows is that I used 42.2 GiB before the first crash and reached 53.5 GiB before the second crash after which I stopped the service.

CarlosLopezDeLara commented 1 year ago

Cannot reproduce.

@Titan-C can you run ls -l /proc/${thenodepid}/fd before it crashes, please?

Titan-C commented 1 year ago

Same as last time netdata monitoring shows a very rapid increase in RAM usage during the validation of chunks, it is certainly all just filling up the cache from reading the files on disk. It fills up the ram saturating at 52.6 GiB, that is all free ram that is not used by other OS processes. Once that cache is full it stays stable ram usage. It completes the validation of chunks, then it replays some blocks, does the chain selection, pushes some ledger state, finally picks a tip and starts extending the chain. It manages to extend the chain 16 times and then crashes. At that point netdata shows a drop in RAM usage from the 52.6GiB to 40.4GiB, that is the time during the core dump.

This is the last moments while operation is normal.

sudo ls -l /proc/5049/fd
total 0
lr-x------ 1 cardano cardano 64 Jun 23 12:09 0 -> /dev/null
lrwx------ 1 cardano cardano 64 Jun 23 12:09 1 -> 'socket:[66586]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 10 -> 'pipe:[58084]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 11 -> 'anon_inode:[eventfd]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 12 -> 'pipe:[58085]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 13 -> 'pipe:[58085]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 14 -> 'anon_inode:[eventfd]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 15 -> 'socket:[66586]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 16 -> 'socket:[61971]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 17 -> 'socket:[61972]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 18 -> /var/lib/cardano-node/db-mainnet/lock
l-wx------ 1 cardano cardano 64 Jun 23 12:09 19 -> /var/lib/cardano-node/db-mainnet/immutable/03345.chunk
lrwx------ 1 cardano cardano 64 Jun 23 12:09 2 -> 'socket:[66586]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 20 -> /var/lib/cardano-node/db-mainnet/immutable/03345.primary
l-wx------ 1 cardano cardano 64 Jun 23 12:09 21 -> /var/lib/cardano-node/db-mainnet/immutable/03345.secondary
l-wx------ 1 cardano cardano 64 Jun 23 12:09 22 -> /var/lib/cardano-node/db-mainnet/volatile/blocks-7792.dat
lrwx------ 1 cardano cardano 64 Jun 23 11:39 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 4 -> 'anon_inode:[timerfd]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 5 -> 'pipe:[58081]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 6 -> 'pipe:[58081]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 7 -> 'anon_inode:[eventfd]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 8 -> 'anon_inode:[eventpoll]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 9 -> 'pipe:[58084]'
free
               total        used        free      shared  buff/cache   available
Mem:        61603324    13402740      521832      174696    47678752    47336800
Swap:              0           0           0

Then there is the crash or just before crash.(I can't precisely assign this a time, was manually launching the command, don't have a timer on my shell)

sudo ls -l /proc/5049/fd
total 0
lr-x------ 1 cardano cardano 64 Jun 23 12:09 0 -> /dev/null
lrwx------ 1 cardano cardano 64 Jun 23 12:09 1 -> 'socket:[66586]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 10 -> 'pipe:[58084]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 11 -> 'anon_inode:[eventfd]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 12 -> 'pipe:[58085]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 13 -> 'pipe:[58085]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 14 -> 'anon_inode:[eventfd]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 15 -> 'socket:[66586]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 16 -> 'socket:[61971]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 17 -> 'socket:[61972]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 18 -> /var/lib/cardano-node/db-mainnet/lock
l-wx------ 1 cardano cardano 64 Jun 23 12:09 19 -> /var/lib/cardano-node/db-mainnet/immutable/03345.chunk
lrwx------ 1 cardano cardano 64 Jun 23 12:09 2 -> 'socket:[66586]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 20 -> /var/lib/cardano-node/db-mainnet/immutable/03345.primary
l-wx------ 1 cardano cardano 64 Jun 23 12:09 21 -> /var/lib/cardano-node/db-mainnet/immutable/03345.secondary
l-wx------ 1 cardano cardano 64 Jun 23 12:09 22 -> /var/lib/cardano-node/db-mainnet/volatile/blocks-7792.dat
lrwx------ 1 cardano cardano 64 Jun 23 12:09 23 -> /var/lib/cardano-node/db-mainnet/ledger/72283294
lr-x------ 1 cardano cardano 64 Jun 23 12:14 24 -> /proc/5049/stat
lrwx------ 1 cardano cardano 64 Jun 23 12:14 25 -> 'socket:[99404]'
lrwx------ 1 cardano cardano 64 Jun 23 12:14 26 -> 'socket:[77687]'
lrwx------ 1 cardano cardano 64 Jun 23 12:14 27 -> 'socket:[75716]'
lrwx------ 1 cardano cardano 64 Jun 23 12:14 28 -> 'socket:[99405]'
lrwx------ 1 cardano cardano 64 Jun 23 12:14 29 -> 'socket:[103426]'
lrwx------ 1 cardano cardano 64 Jun 23 11:39 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 4 -> 'anon_inode:[timerfd]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 5 -> 'pipe:[58081]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 6 -> 'pipe:[58081]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 7 -> 'anon_inode:[eventfd]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 8 -> 'anon_inode:[eventpoll]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 9 -> 'pipe:[58084]'

After the crash during the core-dump the line of number 24 -> /proc/5049/stat disappears.

free
               total        used        free      shared  buff/cache   available
Mem:        61603324    15384844      505364      115644    45713116    45413900
Swap:              0           0           0

The process timeline is 11:39 start node 12:14:24 node crashes 12:15:56 core-dump finished systemd service ends and a new one is restarted

rdlrt commented 1 year ago

Are you sure this isn't an oversubscribed or clustered storage with aggressive memory cache settings?

The period you mention is when IO activity (not necessarily throughput) will be high and bad cache settings for clustered storage can exhaust your memory easily

CarlosLopezDeLara commented 1 year ago

Indeed that cache does not seem right. I was looking for a crazy amount of open files, but it looks normal. I assume you have a script to start the node. How it looks like?

Titan-C commented 1 year ago

This is a desktop machine, I haven't changed Archlinux defaults regarding cache settings. No idea what clustered storage is. I have a nvme disk. Yes the memory cache is high, but in general it is ok in linux to cache in ram files read. And after cache is saturated it is still validating chunks. The crash happens after all the validations have completed and the node is growing the chain with new blocks.

I start the node directly with systemd. This is the service file

[Unit]
Description=Cardano node Mainnet
Requires=network.target

[Service]
Type=simple
Environment=LD_LIBRARY_PATH=/opt/local/lib:
Environment=PKG_CONFIG_PATH=/opt/local/lib/pkgconfig:
ExecStartPre=/usr/bin/env
ExecStart=cardano-node run \
    --database-path "/var/lib/cardano-node/db-mainnet" \
    --socket-path "/run/cardano-node-mainnet/socket" \
    --port 3001 \
    --config "/var/lib/cardano-node/config/mainnet/config.json"  \
    --topology "/var/lib/cardano-node/config/mainnet/topology.json"
KillSignal=SIGINT
Restart=always
RestartSec=60
TimeoutStopSec=600
LimitNOFILE=32768

# Directory creation and permissions
####################################

# Run as cardano:cardano
User=cardano
Group=cardano

# /run/cardano
RuntimeDirectory=cardano-node-mainnet
RuntimeDirectoryMode=0710

# /var/lib/cardano
StateDirectory=cardano-node
StateDirectoryMode=0710

[Install]
WantedBy=multi-user.target
Titan-C commented 1 year ago

I just updated to the cardano-node 8.1.1 using the binary from the github release. I still have the same problem. After incrementing the tip by 15 blocks, the entire node crashes.

coot commented 1 year ago

To how many nodes the node is trying to connect to?

Titan-C commented 1 year ago

@coot It uses the default topology file, thus it only tries to connect to the IOHK relay.

Titan-C commented 1 year ago

I don't know why it makes a difference but I created a swapfile and loaded it. Now it works. The node does consume all RAM in cache, but it is mostly cache. The node itself in operation uses 13.4 GB, while the systemd process including the cache does indicate 52GB. The swap partition only used 256K, but the node hasn't crashed yet. Since previously it crashed after 30 after re-validating chunks and adding 16 blocks, now it has been running for 2 hours and it looks stable. I already synced to the Babbage era.

coot commented 1 year ago

I am glad you found a fix which worked for you. How much resident RAM does it consume after it synced?

Titan-C commented 1 year ago

From gLiveView:

│- NODE RESOURCE USAGE ------------------------------------------------ │ CPU (sys) : 1.21% Mem (Live) : 5.8G GC Minor : 51651 │ Mem (RSS) : 13.1G Mem (Heap) : 13.0G GC Major : 32

coot commented 1 year ago

It all looks good now. I'll close the issue, please re-open or open a new one if you encounter something similar in the future.