Closed Titan-C closed 1 year ago
Cannot reproduce.
@Titan-C can you run ls -l /proc/${thenodepid}/fd
before it crashes, please?
Same as last time netdata monitoring shows a very rapid increase in RAM usage during the validation of chunks, it is certainly all just filling up the cache from reading the files on disk. It fills up the ram saturating at 52.6 GiB, that is all free ram that is not used by other OS processes. Once that cache is full it stays stable ram usage. It completes the validation of chunks, then it replays some blocks, does the chain selection, pushes some ledger state, finally picks a tip and starts extending the chain. It manages to extend the chain 16 times and then crashes. At that point netdata shows a drop in RAM usage from the 52.6GiB to 40.4GiB, that is the time during the core dump.
This is the last moments while operation is normal.
sudo ls -l /proc/5049/fd
total 0
lr-x------ 1 cardano cardano 64 Jun 23 12:09 0 -> /dev/null
lrwx------ 1 cardano cardano 64 Jun 23 12:09 1 -> 'socket:[66586]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 10 -> 'pipe:[58084]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 11 -> 'anon_inode:[eventfd]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 12 -> 'pipe:[58085]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 13 -> 'pipe:[58085]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 14 -> 'anon_inode:[eventfd]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 15 -> 'socket:[66586]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 16 -> 'socket:[61971]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 17 -> 'socket:[61972]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 18 -> /var/lib/cardano-node/db-mainnet/lock
l-wx------ 1 cardano cardano 64 Jun 23 12:09 19 -> /var/lib/cardano-node/db-mainnet/immutable/03345.chunk
lrwx------ 1 cardano cardano 64 Jun 23 12:09 2 -> 'socket:[66586]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 20 -> /var/lib/cardano-node/db-mainnet/immutable/03345.primary
l-wx------ 1 cardano cardano 64 Jun 23 12:09 21 -> /var/lib/cardano-node/db-mainnet/immutable/03345.secondary
l-wx------ 1 cardano cardano 64 Jun 23 12:09 22 -> /var/lib/cardano-node/db-mainnet/volatile/blocks-7792.dat
lrwx------ 1 cardano cardano 64 Jun 23 11:39 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 4 -> 'anon_inode:[timerfd]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 5 -> 'pipe:[58081]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 6 -> 'pipe:[58081]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 7 -> 'anon_inode:[eventfd]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 8 -> 'anon_inode:[eventpoll]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 9 -> 'pipe:[58084]'
free
total used free shared buff/cache available
Mem: 61603324 13402740 521832 174696 47678752 47336800
Swap: 0 0 0
Then there is the crash or just before crash.(I can't precisely assign this a time, was manually launching the command, don't have a timer on my shell)
sudo ls -l /proc/5049/fd
total 0
lr-x------ 1 cardano cardano 64 Jun 23 12:09 0 -> /dev/null
lrwx------ 1 cardano cardano 64 Jun 23 12:09 1 -> 'socket:[66586]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 10 -> 'pipe:[58084]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 11 -> 'anon_inode:[eventfd]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 12 -> 'pipe:[58085]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 13 -> 'pipe:[58085]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 14 -> 'anon_inode:[eventfd]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 15 -> 'socket:[66586]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 16 -> 'socket:[61971]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 17 -> 'socket:[61972]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 18 -> /var/lib/cardano-node/db-mainnet/lock
l-wx------ 1 cardano cardano 64 Jun 23 12:09 19 -> /var/lib/cardano-node/db-mainnet/immutable/03345.chunk
lrwx------ 1 cardano cardano 64 Jun 23 12:09 2 -> 'socket:[66586]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 20 -> /var/lib/cardano-node/db-mainnet/immutable/03345.primary
l-wx------ 1 cardano cardano 64 Jun 23 12:09 21 -> /var/lib/cardano-node/db-mainnet/immutable/03345.secondary
l-wx------ 1 cardano cardano 64 Jun 23 12:09 22 -> /var/lib/cardano-node/db-mainnet/volatile/blocks-7792.dat
lrwx------ 1 cardano cardano 64 Jun 23 12:09 23 -> /var/lib/cardano-node/db-mainnet/ledger/72283294
lr-x------ 1 cardano cardano 64 Jun 23 12:14 24 -> /proc/5049/stat
lrwx------ 1 cardano cardano 64 Jun 23 12:14 25 -> 'socket:[99404]'
lrwx------ 1 cardano cardano 64 Jun 23 12:14 26 -> 'socket:[77687]'
lrwx------ 1 cardano cardano 64 Jun 23 12:14 27 -> 'socket:[75716]'
lrwx------ 1 cardano cardano 64 Jun 23 12:14 28 -> 'socket:[99405]'
lrwx------ 1 cardano cardano 64 Jun 23 12:14 29 -> 'socket:[103426]'
lrwx------ 1 cardano cardano 64 Jun 23 11:39 3 -> 'anon_inode:[eventpoll]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 4 -> 'anon_inode:[timerfd]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 5 -> 'pipe:[58081]'
l-wx------ 1 cardano cardano 64 Jun 23 12:09 6 -> 'pipe:[58081]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 7 -> 'anon_inode:[eventfd]'
lrwx------ 1 cardano cardano 64 Jun 23 12:09 8 -> 'anon_inode:[eventpoll]'
lr-x------ 1 cardano cardano 64 Jun 23 12:09 9 -> 'pipe:[58084]'
After the crash during the core-dump the line of number 24 -> /proc/5049/stat disappears.
free
total used free shared buff/cache available
Mem: 61603324 15384844 505364 115644 45713116 45413900
Swap: 0 0 0
The process timeline is 11:39 start node 12:14:24 node crashes 12:15:56 core-dump finished systemd service ends and a new one is restarted
Are you sure this isn't an oversubscribed or clustered storage with aggressive memory cache settings?
The period you mention is when IO activity (not necessarily throughput) will be high and bad cache settings for clustered storage can exhaust your memory easily
Indeed that cache does not seem right. I was looking for a crazy amount of open files, but it looks normal. I assume you have a script to start the node. How it looks like?
This is a desktop machine, I haven't changed Archlinux defaults regarding cache settings. No idea what clustered storage is. I have a nvme disk. Yes the memory cache is high, but in general it is ok in linux to cache in ram files read. And after cache is saturated it is still validating chunks. The crash happens after all the validations have completed and the node is growing the chain with new blocks.
I start the node directly with systemd. This is the service file
[Unit]
Description=Cardano node Mainnet
Requires=network.target
[Service]
Type=simple
Environment=LD_LIBRARY_PATH=/opt/local/lib:
Environment=PKG_CONFIG_PATH=/opt/local/lib/pkgconfig:
ExecStartPre=/usr/bin/env
ExecStart=cardano-node run \
--database-path "/var/lib/cardano-node/db-mainnet" \
--socket-path "/run/cardano-node-mainnet/socket" \
--port 3001 \
--config "/var/lib/cardano-node/config/mainnet/config.json" \
--topology "/var/lib/cardano-node/config/mainnet/topology.json"
KillSignal=SIGINT
Restart=always
RestartSec=60
TimeoutStopSec=600
LimitNOFILE=32768
# Directory creation and permissions
####################################
# Run as cardano:cardano
User=cardano
Group=cardano
# /run/cardano
RuntimeDirectory=cardano-node-mainnet
RuntimeDirectoryMode=0710
# /var/lib/cardano
StateDirectory=cardano-node
StateDirectoryMode=0710
[Install]
WantedBy=multi-user.target
I just updated to the cardano-node 8.1.1 using the binary from the github release. I still have the same problem. After incrementing the tip by 15 blocks, the entire node crashes.
To how many nodes the node is trying to connect to?
@coot It uses the default topology file, thus it only tries to connect to the IOHK relay.
I don't know why it makes a difference but I created a swapfile and loaded it. Now it works. The node does consume all RAM in cache, but it is mostly cache. The node itself in operation uses 13.4 GB, while the systemd process including the cache does indicate 52GB. The swap partition only used 256K, but the node hasn't crashed yet. Since previously it crashed after 30 after re-validating chunks and adding 16 blocks, now it has been running for 2 hours and it looks stable. I already synced to the Babbage era.
I am glad you found a fix which worked for you. How much resident RAM does it consume after it synced?
From gLiveView:
│- NODE RESOURCE USAGE ------------------------------------------------ │ CPU (sys) : 1.21% Mem (Live) : 5.8G GC Minor : 51651 │ Mem (RSS) : 13.1G Mem (Heap) : 13.0G GC Major : 32
It all looks good now. I'll close the issue, please re-open or open a new one if you encounter something similar in the future.
External Other
Summary Node crashes during regular operation and I can't sync to the network
Steps to reproduce Steps to reproduce the behavior:
Expected behavior The node keeps working without crash
System info (please complete the following information):
Screenshots and attachments this is the systemd logfile
After this crash it takes a long while to to validate again the DB, well it says validating chunks. Then it replays blocks goes back to syncing and crashes again
This is my monitoring log with netdata. You see that the cardano-node-mainnet systemd service used 200% cpu, that is two cores. I'm still syncing to the network and the memory usage seems to grow linearly. You see the first crash at 2:43 that clears all memory use. After the immediate restart it uses 100% CPU(one core) and memory grows rapidly up to a flat max, which is higher than the previous max memory used before the crash. The instant when the memory usage reached this max
corresponds in the log to
Additional context I run the node on a desktop computer with 64GB of ram. Given that your current recommendation for the node is 24GB ram, I should have more than enough. Yet what netdata monitoring shows is that I used 42.2 GiB before the first crash and reached 53.5 GiB before the second crash after which I stopped the service.