ethereum / go-ethereum

Go implementation of the Ethereum protocol
https://geth.ethereum.org
GNU Lesser General Public License v3.0
47.53k stars 20.13k forks source link

Out of memory crash on 16 gb system #24673

Closed giblfiz closed 1 year ago

giblfiz commented 2 years ago

After upgrading from geth 1.10.16 I have started getting frequent spontaneous crashes due to OOM.

from STDOUT it looks like this:

INFO [04-09|10:13:49.315] Imported new chain segment               blocks=5    txs=1011    mgas=64.453  elapsed=9.863s    mgasps=6.534  number=14,504,061 hash=351326..825318 age=1w8h15s  dirty=1018.69MiB
Killed

When I run dmesg I see this:

[4529005.076833] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/,task=geth-precompile,pid=23252,uid=1000
[4529005.087037] Out of memory: Killed process 23252 (geth-precompile) total-vm:16806032kB, anon-rss:10360840kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:26056kB oom_score_adj:0
[4529007.078967] oom_reaper: reaped process 23252 (geth-precompile), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

(note, I renamed this version "geth-precompile" and downloaded it from the website. I usually build from source but when this issue showed up I figured I would also try the precompiled distribution)

This has happened repeatedly during the sync process, about every 8 hours

The system it's running on is an AWS t3.xlarge, which has 16 gigs of memory. The only other process of note running on the system is a teku beacon node.

Thanks for all the great work, and let me know if I can give you more information that is helpful.

giblfiz commented 2 years ago

Just a note: it seems to no longer crash since it has caught up to the head of the chain.

rjl493456442 commented 2 years ago

Can you provide the Geth version, the flags to setup Geth and your system environment information?

giblfiz commented 2 years ago

Geth v 1.17.0 (this build: https://gethstore.blob.core.windows.net/builds/geth-linux-amd64-1.10.17-25c9b49f.tar.gz )

The only flag is "--http"


> uname -a
Linux ip-##censored##.us-west-2.compute.internal 5.10.96-90.460.amzn2.x86_64 #1 SMP Fri Feb 4 17:12:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

> sudo lshw -short
H/W path    Device  Class          Description
==============================================
                    system         t3.xlarge
/0                  bus            Motherboard
/0/0                memory         64KiB BIOS
/0/4                processor      Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
/0/4/5              memory         1536KiB L1 cache
/0/4/6              memory         24MiB L2 cache
/0/4/7              memory         33MiB L3 cache
/0/8                memory         16GiB System Memory
/0/8/0              memory         16GiB DIMM DDR4 Static column Pseudo-static Synchronous Window DRAM 2
/0/100              bridge         440FX - 82441FX PMC [Natoma]
/0/100/1            bridge         82371SB PIIX3 ISA [Natoma/Triton II]
/0/100/1.3          generic        82371AB/EB/MB PIIX4 ACPI
/0/100/3            display        Amazon.com, Inc.
/0/100/4            storage        Amazon.com, Inc.
/0/100/5    eth0    network        Elastic Network Adapter (ENA)
/0/1                system         PnP device PNP0b00
/0/2                input          PnP device PNP0303
/0/3                input          PnP device PNP0f13
/0/5                printer        PnP device PNP0400
/0/6                communication  PnP device PNP0501
karalabe commented 2 years ago

Could you do a ps | aux and provide the lines for Geth and Teku? Geth does eat up quite a bit of RAM from time to time and would be nice to know how much is Teku keeping hold of. Perhaps there's some scenario where the two just overload the machine beyond the available 16GB.

giblfiz commented 2 years ago

I'm assuming you meant ps -aux ?

$ ps -aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0 191332  3284 ?        Ss   Feb16   2:10 /usr/lib/systemd/systemd --switched-roo
root         2  0.0  0.0      0     0 ?        S    Feb16   0:01 [kthreadd]
root         3  0.0  0.0      0     0 ?        I<   Feb16   0:00 [rcu_gp]
root         4  0.0  0.0      0     0 ?        I<   Feb16   0:00 [rcu_par_gp]
root         6  0.0  0.0      0     0 ?        I<   Feb16   0:00 [kworker/0:0H-ev]
root         9  0.0  0.0      0     0 ?        I<   Feb16   0:00 [mm_percpu_wq]
root        10  0.0  0.0      0     0 ?        S    Feb16   0:00 [rcu_tasks_rude_]
root        11  0.0  0.0      0     0 ?        S    Feb16   0:00 [rcu_tasks_trace]
root        12  0.0  0.0      0     0 ?        S    Feb16   5:26 [ksoftirqd/0]
root        13  0.0  0.0      0     0 ?        I    Feb16  22:54 [rcu_sched]
root        14  0.0  0.0      0     0 ?        S    Feb16   0:25 [migration/0]
root        15  0.0  0.0      0     0 ?        S    Feb16   0:00 [cpuhp/0]
root        16  0.0  0.0      0     0 ?        S    Feb16   0:00 [cpuhp/1]
root        17  0.0  0.0      0     0 ?        S    Feb16   0:26 [migration/1]
root        18  0.0  0.0      0     0 ?        S    Feb16   5:23 [ksoftirqd/1]
root        20  0.0  0.0      0     0 ?        I<   Feb16   0:00 [kworker/1:0H-ev]
root        21  0.0  0.0      0     0 ?        S    Feb16   0:00 [cpuhp/2]
root        22  0.0  0.0      0     0 ?        S    Feb16   0:14 [migration/2]
root        23  0.0  0.0      0     0 ?        S    Feb16   4:51 [ksoftirqd/2]
root        25  0.0  0.0      0     0 ?        I<   Feb16   0:00 [kworker/2:0H-ev]
root        26  0.0  0.0      0     0 ?        S    Feb16   0:00 [cpuhp/3]
root        27  0.0  0.0      0     0 ?        S    Feb16   0:14 [migration/3]
root        28  0.0  0.0      0     0 ?        S    Feb16   4:55 [ksoftirqd/3]
root        30  0.0  0.0      0     0 ?        I<   Feb16   0:00 [kworker/3:0H-ev]
root        35  0.0  0.0      0     0 ?        S    Feb16   0:00 [kdevtmpfs]
root        36  0.0  0.0      0     0 ?        I<   Feb16   0:00 [netns]
root        39  0.0  0.0      0     0 ?        S    Feb16   0:01 [kauditd]
root       284  0.0  0.0      0     0 ?        S    Feb16   0:02 [khungtaskd]
root       285  0.0  0.0      0     0 ?        S    Feb16   0:04 [oom_reaper]
root       286  0.0  0.0      0     0 ?        I<   Feb16   0:00 [writeback]
root       288  0.0  0.0      0     0 ?        S    Feb16  39:39 [kcompactd0]
root       289  0.0  0.0      0     0 ?        SN   Feb16   0:00 [ksmd]
root       290  0.0  0.0      0     0 ?        SN   Feb16   3:11 [khugepaged]
root       345  0.0  0.0      0     0 ?        I<   Feb16   0:00 [kintegrityd]
root       346  0.0  0.0      0     0 ?        I<   Feb16   0:00 [kblockd]
root       348  0.0  0.0      0     0 ?        I<   Feb16   0:00 [blkcg_punt_bio]
root       457  0.0  0.0      0     0 ?        I<   Feb16   0:00 [tpm_dev_wq]
root       464  0.0  0.0      0     0 ?        I<   Feb16   0:00 [md]
root       471  0.0  0.0      0     0 ?        I<   Feb16   0:00 [edac-poller]
root       476  0.0  0.0      0     0 ?        S    Feb16   0:00 [watchdogd]
root       567  0.0  0.0      0     0 ?        I<   Feb16   1:08 [kworker/2:1H-xf]
root       613  0.0  0.0      0     0 ?        S    Feb16  64:43 [kswapd0]
root       615  0.0  0.0      0     0 ?        I<   Feb16   0:00 [xfsalloc]
root       616  0.0  0.0      0     0 ?        I<   Feb16   0:00 [xfs_mru_cache]
root       619  0.0  0.0      0     0 ?        I<   Feb16   0:00 [kthrotld]
root       665  0.0  0.0      0     0 ?        I<   Feb16   0:00 [nvme-wq]
root       667  0.0  0.0      0     0 ?        I<   Feb16   0:00 [nvme-reset-wq]
root       668  0.0  0.0      0     0 ?        I<   Feb16   0:00 [nvme-delete-wq]
root       702  0.0  0.0      0     0 ?        I<   Feb16   0:00 [ipv6_addrconf]
root       703  0.0  0.0      0     0 ?        I<   Feb16   0:19 [kworker/1:1H-kb]
root       712  0.0  0.0      0     0 ?        I<   Feb16   0:00 [kstrp]
root       725  0.0  0.0      0     0 ?        I<   Feb16   0:00 [zswap-shrink]
root       726  0.0  0.0      0     0 ?        I<   Feb16   0:00 [kworker/u9:0]
root       774  0.0  0.0      0     0 ?        I    17:42   0:00 [kworker/0:2-eve]
postfix    943  0.0  0.0  90432  3532 ?        S    16:41   0:00 pickup -l -t unix -u
root      1283  0.0  0.0      0     0 ?        I<   Feb16   1:03 [kworker/3:1H-xf]
root      1293  0.0  0.0      0     0 ?        I<   Feb16   0:00 [xfs-buf/nvme0n1]
root      1294  0.0  0.0      0     0 ?        I<   Feb16   0:00 [xfs-conv/nvme0n]
root      1295  0.0  0.0      0     0 ?        I<   Feb16   0:00 [xfs-cil/nvme0n1]
root      1296  0.0  0.0      0     0 ?        I<   Feb16   0:00 [xfs-reclaim/nvm]
root      1297  0.0  0.0      0     0 ?        I<   Feb16   0:00 [xfs-eofblocks/n]
root      1298  0.0  0.0      0     0 ?        I<   Feb16   0:00 [xfs-log/nvme0n1]
root      1299  0.0  0.0      0     0 ?        S    Feb16  10:54 [xfsaild/nvme0n1]
root      1300  0.0  0.0      0     0 ?        I<   Feb16   0:20 [kworker/0:1H-kb]
root      1363  0.0  0.0 186844 13040 ?        Ss   Feb16   2:01 /usr/lib/systemd/systemd-journald
root      1388  0.0  0.0 118804   276 ?        Ss   Feb16   0:00 /usr/sbin/lvmetad -f
root      1395  0.0  0.0      0     0 ?        I<   Feb16   0:00 [ena]
root      1412  0.0  0.0  46176   848 ?        Ss   Feb16   0:00 /usr/lib/systemd/systemd-udevd
root      1948  0.0  0.0      0     0 ?        I<   Feb16   0:00 [cryptd]
root      2051  0.0  0.0      0     0 ?        I<   Feb16   0:00 [rpciod]
root      2052  0.0  0.0      0     0 ?        I<   Feb16   0:00 [xprtiod]
root      2056  0.0  0.0  59740   468 ?        S<sl Feb16   0:05 /sbin/auditd
dbus      2083  0.0  0.0  60480   716 ?        Ss   Feb16   2:04 /usr/bin/dbus-daemon --system --address
rpc       2084  0.0  0.0  69352   552 ?        Ss   Feb16   0:04 /sbin/rpcbind -w
root      2085  0.0  0.0 101912   304 ?        Ssl  Feb16   1:32 /usr/sbin/irqbalance --foreground
libstor+  2086  0.0  0.0  12624   172 ?        Ss   Feb16   0:06 /usr/bin/lsmd -d
root      2090  0.0  0.0  28752  1044 ?        Ss   Feb16   0:38 /usr/lib/systemd/systemd-logind
rngd      2103  0.0  0.0  94100   848 ?        Ss   Feb16   0:00 /sbin/rngd -f --fill-watermark=0 --excl
root      2126  0.0  0.0 101596   472 ?        Ssl  Feb16   0:00 /usr/sbin/gssproxy -D
root      2331  0.0  0.0 100724  3128 ?        Ss   Feb16   0:03 /sbin/dhclient -q -lf /var/lib/dhclient
root      2376  0.0  0.0 100724  2044 ?        Ss   Feb16   0:06 /sbin/dhclient -6 -nw -lf /var/lib/dhcl
root      2529  0.0  0.0  90348  1284 ?        Ss   Feb16   0:09 /usr/libexec/postfix/master -w
postfix   2531  0.0  0.0  90512  1036 ?        S    Feb16   0:02 qmgr -l -t unix -u
root      2643  0.0  0.0  27888   208 ?        Ss   Feb16   0:00 /usr/sbin/atd -f
root      2659  0.0  0.0 121304   124 tty1     Ss+  Feb16   0:00 /sbin/agetty --noclear tty1 linux
root      2660  0.0  0.0  10552   128 ttyS0    Ss+  Feb16   0:00 /sbin/agetty --keep-baud 115200,38400,9
root      2662  0.0  0.0      0     0 ?        I    15:44   0:00 [kworker/u8:1-ev]
root      2716  0.0  0.0 152696  8668 ?        Ss   17:45   0:00 sshd: ec2-user [priv]
root      2844  0.0  0.0   4264   104 ?        Ss   Feb16   0:00 /usr/sbin/acpid
ec2-user  2860  0.0  0.0 152696  4432 ?        R    17:45   0:00 sshd: ec2-user@pts/0
ec2-user  2861  0.1  0.0 124860  4076 pts/0    Ss   17:45   0:00 -bash
root      3058  0.0  0.0   4240   736 ?        S    17:46   0:00 sleep 1
ec2-user  3060  0.0  0.0 164364  3768 pts/0    R+   17:46   0:00 ps -aux
root      3063  0.0  0.0 112916  1776 ?        Ss   Feb16   0:00 /usr/sbin/sshd -D
root      3698  0.0  0.0      0     0 ?        I    16:46   0:00 [kworker/0:0-eve]
ec2-user  7251  0.0  0.0 135068  1336 ?        Ss   Feb16   0:37 SCREEN -S geth
ec2-user  7252  0.0  0.0 125012  2240 pts/1    Ss+  Feb16   0:00 /bin/bash
ec2-user  8848  0.0  0.0 134772  1220 ?        Ss   Feb16   0:25 SCREEN -S teku
ec2-user  8849  0.0  0.0 125012  1220 pts/3    Ss   Feb16   0:00 /bin/bash
ec2-user  9090  172 31.5 8645616 5103308 pts/3 Sl+  Apr08 14319:54 java -Dvertx.disableFileCPResolving=t
ec2-user  9765 51.1 52.1 17116708 8438168 ?    Ssl  Apr09 3514:06 /home/ec2-user/geth-precompile --http
root     10239  0.0  0.0      0     0 ?        I    16:58   0:00 [kworker/3:1-eve]
root     15471  0.0  0.0      0     0 ?        I    17:08   0:00 [kworker/u8:2-ev]
chrony   18968  0.0  0.0 105108   908 ?        S    Apr08   0:06 /usr/sbin/chronyd
root     20739  0.0  0.0  24688  1832 ?        Ss   Apr08   0:00 /usr/sbin/crond -n
root     21856  0.0  0.0 718888  8612 ?        Ssl  Apr08   0:27 /usr/bin/amazon-ssm-agent
root     21899  0.0  0.0 460800  1756 ?        Ssl  Apr08   0:33 /usr/sbin/rsyslogd -n
root     22053  0.0  0.0 731252 14780 ?        Sl   Apr08   0:20 /usr/bin/ssm-agent-worker
root     25390  0.0  0.0      0     0 ?        I    17:27   0:00 [kworker/3:2-eve]
root     25925  0.0  0.0      0     0 ?        I    17:28   0:00 [kworker/2:2-xfs]
root     26233  0.0  0.0  13776  2688 ?        Ss   17:29   0:00 /bin/bash /usr/bin/log4j-cve-2021-44228
root     26396  0.0  0.0      0     0 ?        I    17:29   0:00 [kworker/0:1-mm_]
root     32028  0.0  0.0      0     0 ?        I    17:40   0:00 [kworker/1:1]
root     32030  0.0  0.0      0     0 ?        I    17:40   0:00 [kworker/1:3-mm_]
root     32214  0.0  0.0      0     0 ?        I    17:40   0:00 [kworker/2:1-eve]

Also worth mentioning again, once syncing completed it stopped having memory failures. So it has been up and stable in the current configuration for ~4 days now. Sorry I didn't get a snapshot of that while it was having the issue.

If you really want to chase this hard, I can probably clone the instance and see if I can replicate it on the testnet.

rjl493456442 commented 2 years ago

ec2-user 9090 172 31.5 8645616 5103308 pts/3 Sl+ Apr08 14319:54 java -Dvertx.disableFileCPResolving=t

This process also uses 31.5% memory resource. If you only run Geth in a 16GB machine, then it's enough and shouldn't be panic. While our hunch is that you are running something else in the same machine(let's say teku, we don't know how many memory it will use), so the available memory is less than 16GB. And normally for mainnet Geth will use around 10GB memory with default configs.

fjl commented 2 years ago

Please try setting the database cache amount. For example, this sets it to 2GB:

geth --cache 2500

The default cache on mainnet is 4GB and that might be too much. Geth generally uses more memory than the configured cache amount, it's just a hint.

MariusVanDerWijden commented 1 year ago

Issue seems resolved, memory usage after sync went down. I remember that around this time we might've used more memory than needed during sync which was resolved. Will close, feel free to open another issue if you still see ooms during sync on 16gb