Closed giblfiz closed 1 year ago
Just a note: it seems to no longer crash since it has caught up to the head of the chain.
Can you provide the Geth version, the flags to setup Geth and your system environment information?
Geth v 1.17.0 (this build: https://gethstore.blob.core.windows.net/builds/geth-linux-amd64-1.10.17-25c9b49f.tar.gz )
The only flag is "--http"
> uname -a
Linux ip-##censored##.us-west-2.compute.internal 5.10.96-90.460.amzn2.x86_64 #1 SMP Fri Feb 4 17:12:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
> sudo lshw -short
H/W path Device Class Description
==============================================
system t3.xlarge
/0 bus Motherboard
/0/0 memory 64KiB BIOS
/0/4 processor Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
/0/4/5 memory 1536KiB L1 cache
/0/4/6 memory 24MiB L2 cache
/0/4/7 memory 33MiB L3 cache
/0/8 memory 16GiB System Memory
/0/8/0 memory 16GiB DIMM DDR4 Static column Pseudo-static Synchronous Window DRAM 2
/0/100 bridge 440FX - 82441FX PMC [Natoma]
/0/100/1 bridge 82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.3 generic 82371AB/EB/MB PIIX4 ACPI
/0/100/3 display Amazon.com, Inc.
/0/100/4 storage Amazon.com, Inc.
/0/100/5 eth0 network Elastic Network Adapter (ENA)
/0/1 system PnP device PNP0b00
/0/2 input PnP device PNP0303
/0/3 input PnP device PNP0f13
/0/5 printer PnP device PNP0400
/0/6 communication PnP device PNP0501
Could you do a ps | aux
and provide the lines for Geth and Teku? Geth does eat up quite a bit of RAM from time to time and would be nice to know how much is Teku keeping hold of. Perhaps there's some scenario where the two just overload the machine beyond the available 16GB.
I'm assuming you meant ps -aux
?
$ ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 191332 3284 ? Ss Feb16 2:10 /usr/lib/systemd/systemd --switched-roo
root 2 0.0 0.0 0 0 ? S Feb16 0:01 [kthreadd]
root 3 0.0 0.0 0 0 ? I< Feb16 0:00 [rcu_gp]
root 4 0.0 0.0 0 0 ? I< Feb16 0:00 [rcu_par_gp]
root 6 0.0 0.0 0 0 ? I< Feb16 0:00 [kworker/0:0H-ev]
root 9 0.0 0.0 0 0 ? I< Feb16 0:00 [mm_percpu_wq]
root 10 0.0 0.0 0 0 ? S Feb16 0:00 [rcu_tasks_rude_]
root 11 0.0 0.0 0 0 ? S Feb16 0:00 [rcu_tasks_trace]
root 12 0.0 0.0 0 0 ? S Feb16 5:26 [ksoftirqd/0]
root 13 0.0 0.0 0 0 ? I Feb16 22:54 [rcu_sched]
root 14 0.0 0.0 0 0 ? S Feb16 0:25 [migration/0]
root 15 0.0 0.0 0 0 ? S Feb16 0:00 [cpuhp/0]
root 16 0.0 0.0 0 0 ? S Feb16 0:00 [cpuhp/1]
root 17 0.0 0.0 0 0 ? S Feb16 0:26 [migration/1]
root 18 0.0 0.0 0 0 ? S Feb16 5:23 [ksoftirqd/1]
root 20 0.0 0.0 0 0 ? I< Feb16 0:00 [kworker/1:0H-ev]
root 21 0.0 0.0 0 0 ? S Feb16 0:00 [cpuhp/2]
root 22 0.0 0.0 0 0 ? S Feb16 0:14 [migration/2]
root 23 0.0 0.0 0 0 ? S Feb16 4:51 [ksoftirqd/2]
root 25 0.0 0.0 0 0 ? I< Feb16 0:00 [kworker/2:0H-ev]
root 26 0.0 0.0 0 0 ? S Feb16 0:00 [cpuhp/3]
root 27 0.0 0.0 0 0 ? S Feb16 0:14 [migration/3]
root 28 0.0 0.0 0 0 ? S Feb16 4:55 [ksoftirqd/3]
root 30 0.0 0.0 0 0 ? I< Feb16 0:00 [kworker/3:0H-ev]
root 35 0.0 0.0 0 0 ? S Feb16 0:00 [kdevtmpfs]
root 36 0.0 0.0 0 0 ? I< Feb16 0:00 [netns]
root 39 0.0 0.0 0 0 ? S Feb16 0:01 [kauditd]
root 284 0.0 0.0 0 0 ? S Feb16 0:02 [khungtaskd]
root 285 0.0 0.0 0 0 ? S Feb16 0:04 [oom_reaper]
root 286 0.0 0.0 0 0 ? I< Feb16 0:00 [writeback]
root 288 0.0 0.0 0 0 ? S Feb16 39:39 [kcompactd0]
root 289 0.0 0.0 0 0 ? SN Feb16 0:00 [ksmd]
root 290 0.0 0.0 0 0 ? SN Feb16 3:11 [khugepaged]
root 345 0.0 0.0 0 0 ? I< Feb16 0:00 [kintegrityd]
root 346 0.0 0.0 0 0 ? I< Feb16 0:00 [kblockd]
root 348 0.0 0.0 0 0 ? I< Feb16 0:00 [blkcg_punt_bio]
root 457 0.0 0.0 0 0 ? I< Feb16 0:00 [tpm_dev_wq]
root 464 0.0 0.0 0 0 ? I< Feb16 0:00 [md]
root 471 0.0 0.0 0 0 ? I< Feb16 0:00 [edac-poller]
root 476 0.0 0.0 0 0 ? S Feb16 0:00 [watchdogd]
root 567 0.0 0.0 0 0 ? I< Feb16 1:08 [kworker/2:1H-xf]
root 613 0.0 0.0 0 0 ? S Feb16 64:43 [kswapd0]
root 615 0.0 0.0 0 0 ? I< Feb16 0:00 [xfsalloc]
root 616 0.0 0.0 0 0 ? I< Feb16 0:00 [xfs_mru_cache]
root 619 0.0 0.0 0 0 ? I< Feb16 0:00 [kthrotld]
root 665 0.0 0.0 0 0 ? I< Feb16 0:00 [nvme-wq]
root 667 0.0 0.0 0 0 ? I< Feb16 0:00 [nvme-reset-wq]
root 668 0.0 0.0 0 0 ? I< Feb16 0:00 [nvme-delete-wq]
root 702 0.0 0.0 0 0 ? I< Feb16 0:00 [ipv6_addrconf]
root 703 0.0 0.0 0 0 ? I< Feb16 0:19 [kworker/1:1H-kb]
root 712 0.0 0.0 0 0 ? I< Feb16 0:00 [kstrp]
root 725 0.0 0.0 0 0 ? I< Feb16 0:00 [zswap-shrink]
root 726 0.0 0.0 0 0 ? I< Feb16 0:00 [kworker/u9:0]
root 774 0.0 0.0 0 0 ? I 17:42 0:00 [kworker/0:2-eve]
postfix 943 0.0 0.0 90432 3532 ? S 16:41 0:00 pickup -l -t unix -u
root 1283 0.0 0.0 0 0 ? I< Feb16 1:03 [kworker/3:1H-xf]
root 1293 0.0 0.0 0 0 ? I< Feb16 0:00 [xfs-buf/nvme0n1]
root 1294 0.0 0.0 0 0 ? I< Feb16 0:00 [xfs-conv/nvme0n]
root 1295 0.0 0.0 0 0 ? I< Feb16 0:00 [xfs-cil/nvme0n1]
root 1296 0.0 0.0 0 0 ? I< Feb16 0:00 [xfs-reclaim/nvm]
root 1297 0.0 0.0 0 0 ? I< Feb16 0:00 [xfs-eofblocks/n]
root 1298 0.0 0.0 0 0 ? I< Feb16 0:00 [xfs-log/nvme0n1]
root 1299 0.0 0.0 0 0 ? S Feb16 10:54 [xfsaild/nvme0n1]
root 1300 0.0 0.0 0 0 ? I< Feb16 0:20 [kworker/0:1H-kb]
root 1363 0.0 0.0 186844 13040 ? Ss Feb16 2:01 /usr/lib/systemd/systemd-journald
root 1388 0.0 0.0 118804 276 ? Ss Feb16 0:00 /usr/sbin/lvmetad -f
root 1395 0.0 0.0 0 0 ? I< Feb16 0:00 [ena]
root 1412 0.0 0.0 46176 848 ? Ss Feb16 0:00 /usr/lib/systemd/systemd-udevd
root 1948 0.0 0.0 0 0 ? I< Feb16 0:00 [cryptd]
root 2051 0.0 0.0 0 0 ? I< Feb16 0:00 [rpciod]
root 2052 0.0 0.0 0 0 ? I< Feb16 0:00 [xprtiod]
root 2056 0.0 0.0 59740 468 ? S<sl Feb16 0:05 /sbin/auditd
dbus 2083 0.0 0.0 60480 716 ? Ss Feb16 2:04 /usr/bin/dbus-daemon --system --address
rpc 2084 0.0 0.0 69352 552 ? Ss Feb16 0:04 /sbin/rpcbind -w
root 2085 0.0 0.0 101912 304 ? Ssl Feb16 1:32 /usr/sbin/irqbalance --foreground
libstor+ 2086 0.0 0.0 12624 172 ? Ss Feb16 0:06 /usr/bin/lsmd -d
root 2090 0.0 0.0 28752 1044 ? Ss Feb16 0:38 /usr/lib/systemd/systemd-logind
rngd 2103 0.0 0.0 94100 848 ? Ss Feb16 0:00 /sbin/rngd -f --fill-watermark=0 --excl
root 2126 0.0 0.0 101596 472 ? Ssl Feb16 0:00 /usr/sbin/gssproxy -D
root 2331 0.0 0.0 100724 3128 ? Ss Feb16 0:03 /sbin/dhclient -q -lf /var/lib/dhclient
root 2376 0.0 0.0 100724 2044 ? Ss Feb16 0:06 /sbin/dhclient -6 -nw -lf /var/lib/dhcl
root 2529 0.0 0.0 90348 1284 ? Ss Feb16 0:09 /usr/libexec/postfix/master -w
postfix 2531 0.0 0.0 90512 1036 ? S Feb16 0:02 qmgr -l -t unix -u
root 2643 0.0 0.0 27888 208 ? Ss Feb16 0:00 /usr/sbin/atd -f
root 2659 0.0 0.0 121304 124 tty1 Ss+ Feb16 0:00 /sbin/agetty --noclear tty1 linux
root 2660 0.0 0.0 10552 128 ttyS0 Ss+ Feb16 0:00 /sbin/agetty --keep-baud 115200,38400,9
root 2662 0.0 0.0 0 0 ? I 15:44 0:00 [kworker/u8:1-ev]
root 2716 0.0 0.0 152696 8668 ? Ss 17:45 0:00 sshd: ec2-user [priv]
root 2844 0.0 0.0 4264 104 ? Ss Feb16 0:00 /usr/sbin/acpid
ec2-user 2860 0.0 0.0 152696 4432 ? R 17:45 0:00 sshd: ec2-user@pts/0
ec2-user 2861 0.1 0.0 124860 4076 pts/0 Ss 17:45 0:00 -bash
root 3058 0.0 0.0 4240 736 ? S 17:46 0:00 sleep 1
ec2-user 3060 0.0 0.0 164364 3768 pts/0 R+ 17:46 0:00 ps -aux
root 3063 0.0 0.0 112916 1776 ? Ss Feb16 0:00 /usr/sbin/sshd -D
root 3698 0.0 0.0 0 0 ? I 16:46 0:00 [kworker/0:0-eve]
ec2-user 7251 0.0 0.0 135068 1336 ? Ss Feb16 0:37 SCREEN -S geth
ec2-user 7252 0.0 0.0 125012 2240 pts/1 Ss+ Feb16 0:00 /bin/bash
ec2-user 8848 0.0 0.0 134772 1220 ? Ss Feb16 0:25 SCREEN -S teku
ec2-user 8849 0.0 0.0 125012 1220 pts/3 Ss Feb16 0:00 /bin/bash
ec2-user 9090 172 31.5 8645616 5103308 pts/3 Sl+ Apr08 14319:54 java -Dvertx.disableFileCPResolving=t
ec2-user 9765 51.1 52.1 17116708 8438168 ? Ssl Apr09 3514:06 /home/ec2-user/geth-precompile --http
root 10239 0.0 0.0 0 0 ? I 16:58 0:00 [kworker/3:1-eve]
root 15471 0.0 0.0 0 0 ? I 17:08 0:00 [kworker/u8:2-ev]
chrony 18968 0.0 0.0 105108 908 ? S Apr08 0:06 /usr/sbin/chronyd
root 20739 0.0 0.0 24688 1832 ? Ss Apr08 0:00 /usr/sbin/crond -n
root 21856 0.0 0.0 718888 8612 ? Ssl Apr08 0:27 /usr/bin/amazon-ssm-agent
root 21899 0.0 0.0 460800 1756 ? Ssl Apr08 0:33 /usr/sbin/rsyslogd -n
root 22053 0.0 0.0 731252 14780 ? Sl Apr08 0:20 /usr/bin/ssm-agent-worker
root 25390 0.0 0.0 0 0 ? I 17:27 0:00 [kworker/3:2-eve]
root 25925 0.0 0.0 0 0 ? I 17:28 0:00 [kworker/2:2-xfs]
root 26233 0.0 0.0 13776 2688 ? Ss 17:29 0:00 /bin/bash /usr/bin/log4j-cve-2021-44228
root 26396 0.0 0.0 0 0 ? I 17:29 0:00 [kworker/0:1-mm_]
root 32028 0.0 0.0 0 0 ? I 17:40 0:00 [kworker/1:1]
root 32030 0.0 0.0 0 0 ? I 17:40 0:00 [kworker/1:3-mm_]
root 32214 0.0 0.0 0 0 ? I 17:40 0:00 [kworker/2:1-eve]
Also worth mentioning again, once syncing completed it stopped having memory failures. So it has been up and stable in the current configuration for ~4 days now. Sorry I didn't get a snapshot of that while it was having the issue.
If you really want to chase this hard, I can probably clone the instance and see if I can replicate it on the testnet.
ec2-user 9090 172 31.5 8645616 5103308 pts/3 Sl+ Apr08 14319:54 java -Dvertx.disableFileCPResolving=t
This process also uses 31.5% memory resource. If you only run Geth in a 16GB machine, then it's enough and shouldn't be panic. While our hunch is that you are running something else in the same machine(let's say teku, we don't know how many memory it will use), so the available memory is less than 16GB. And normally for mainnet Geth will use around 10GB memory with default configs.
Please try setting the database cache amount. For example, this sets it to 2GB:
geth --cache 2500
The default cache on mainnet is 4GB and that might be too much. Geth generally uses more memory than the configured cache amount, it's just a hint.
Issue seems resolved, memory usage after sync went down. I remember that around this time we might've used more memory than needed during sync which was resolved. Will close, feel free to open another issue if you still see ooms during sync on 16gb
After upgrading from geth 1.10.16 I have started getting frequent spontaneous crashes due to OOM.
from STDOUT it looks like this:
When I run dmesg I see this:
(note, I renamed this version "geth-precompile" and downloaded it from the website. I usually build from source but when this issue showed up I figured I would also try the precompiled distribution)
This has happened repeatedly during the sync process, about every 8 hours
The system it's running on is an AWS t3.xlarge, which has 16 gigs of memory. The only other process of note running on the system is a teku beacon node.
Thanks for all the great work, and let me know if I can give you more information that is helpful.