ethereum / go-ethereum

Go implementation of the Ethereum protocol
https://geth.ethereum.org
GNU Lesser General Public License v3.0
47.54k stars 20.13k forks source link

Many of my nodes are centrally OOM #30196

Closed YuXiaoCoder closed 3 months ago

YuXiaoCoder commented 3 months ago

System information

Geth version: geth version 1.14.5-stable-0dd173a7 CL client & version: prysm@v5.0.4 OS & Version: Linux Docker

Expected behaviour

The nodes are synchronised properly without OOM so that my validator can submit proofs and propose blocks properly

Actual behaviour

The program is killed and the node stops synchronising, causing the block to fall behind

Steps to reproduce the behaviour

I've been running the node normally for over 2 years and this is the first time I've encountered this issue

/opt/ethmain/core/geth --config=/mnt/ethmain/conf/config.toml --rpc.gascap=0 --rpc.txfeecap=0
[Eth]
NetworkId = 1
SyncMode = "snap"
EthDiscoveryURLs = ["enrtree://AKA3AM6LPBYEUDMVNU3BSVQJ5AD45Y7YPOHJLEF6W26QOE4VTUDPE@all.mainnet.ethdisco.net"]
SnapDiscoveryURLs = ["enrtree://AKA3AM6LPBYEUDMVNU3BSVQJ5AD45Y7YPOHJLEF6W26QOE4VTUDPE@all.mainnet.ethdisco.net"]
NoPruning = true  # GCMODE
NoPrefetch = false
TxLookupLimit = 2350000
TransactionHistory = 2350000
StateHistory = 90000
StateScheme = "path"
LightPeers = 100
DatabaseCache = 512
DatabaseFreezer = ""
TrieCleanCache = 154
TrieDirtyCache = 256
TrieTimeout = 3600000000000
SnapshotCache = 102
Preimages = true
FilterLogCacheSize = 32
EnablePreimageRecording = false
RPCEVMTimeout = 5000000000

[Eth.TxPool]
Locals = []
NoLocals = false
Journal = "transactions.rlp"
Rejournal = 3600000000000
PriceLimit = 1
PriceBump = 10
AccountSlots = 16
GlobalSlots = 5120
AccountQueue = 64
GlobalQueue = 1024
Lifetime = 10800000000000

[Eth.BlobPool]
Datadir = "blobpool"
Datacap = 10737418240
PriceBump = 100

[Eth.GPO]
Blocks = 20
Percentile = 60
MaxHeaderHistory = 1024
MaxBlockHistory = 1024
MaxPrice = 500000000000
IgnorePrice = 2

[Node]
DataDir = "/mnt/ethmain/node/geth"
IPCPath = "/mnt/ethmain/node/geth.ipc"
HTTPHost = "0.0.0.0"
HTTPPort = 8545
HTTPVirtualHosts = ["*"]
HTTPModules = ["admin", "eth", "net", "engine", "web3", "personal", "debug", "txpool"]
AuthAddr = "0.0.0.0"
AuthPort = 8551
AuthVirtualHosts = ["*"]
WSHost = "0.0.0.0"
WSPort = 8546
WSOrigins = ["*"]
WSModules = ["admin", "eth", "net", "web3", "personal", "debug", "txpool"]
JWTSecret = "/mnt/ethmain/node/jwt.hex"
DBEngine = "pebble"

[Node.P2P]
MaxPeers = 50
NoDiscovery = false
DiscoveryV4 = true
BootstrapNodes = ["enode://d860a01f9722d78051619d1e2351aba3f43f943f6f00718d1b9baa4101932a1f5011f16bb2b1bb35db20d6fe28fa0bf09636d26a87d31de9ec6203eeedb1f666@18.138.108.67:30303", "enode://22a8232c3abc76a16ae9d6c3b164f98775fe226f0917b0ca871128a74a8e9630b458460865bab457221f1d448dd9791d24c4e5d88786180ac185df813a68d4de@3.209.45.79:30303", "enode://2b252ab6a1d0f971d9722cb839a42cb81db019ba44c08754628ab4a823487071b5695317c8ccd085219c3a03af063495b2f1da8d18218da2d6a82981b45e6ffc@65.108.70.101:30303", "enode://4aeb4ab6c14b23e2c4cfdce879c04b0748a20d8e9b59e25ded2a08143e265c6c25936e74cbc8e641e3312ca288673d91f2f93f8e277de3cfa444ecdaaf982052@157.90.35.166:30303"]
BootstrapNodesV5 = ["enr:-KG4QOtcP9X1FbIMOe17QNMKqDxCpm14jcX5tiOE4_TyMrFqbmhPZHK_ZPG2Gxb1GE2xdtodOfx9-cgvNtxnRyHEmC0ghGV0aDKQ9aX9QgAAAAD__________4JpZIJ2NIJpcIQDE8KdiXNlY3AyNTZrMaEDhpehBDbZjM_L9ek699Y7vhUJ-eAdMyQW_Fil522Y0fODdGNwgiMog3VkcIIjKA", "enr:-KG4QDyytgmE4f7AnvW-ZaUOIi9i79qX4JwjRAiXBZCU65wOfBu-3Nb5I7b_Rmg3KCOcZM_C3y5pg7EBU5XGrcLTduQEhGV0aDKQ9aX9QgAAAAD__________4JpZIJ2NIJpcIQ2_DUbiXNlY3AyNTZrMaEDKnz_-ps3UUOfHWVYaskI5kWYO_vtYMGYCQRAR3gHDouDdGNwgiMog3VkcIIjKA", "enr:-Ku4QImhMc1z8yCiNJ1TyUxdcfNucje3BGwEHzodEZUan8PherEo4sF7pPHPSIB1NNuSg5fZy7qFsjmUKs2ea1Whi0EBh2F0dG5ldHOIAAAAAAAAAACEZXRoMpD1pf1CAAAAAP__________gmlkgnY0gmlwhBLf22SJc2VjcDI1NmsxoQOVphkDqal4QzPMksc5wnpuC3gvSC8AfbFOnZY_On34wIN1ZHCCIyg", "enr:-Ku4QP2xDnEtUXIjzJ_DhlCRN9SN99RYQPJL92TMlSv7U5C1YnYLjwOQHgZIUXw6c-BvRg2Yc2QsZxxoS_pPRVe0yK8Bh2F0dG5ldHOIAAAAAAAAAACEZXRoMpD1pf1CAAAAAP__________gmlkgnY0gmlwhBLf22SJc2VjcDI1NmsxoQMeFF5GrS7UZpAH2Ly84aLK-TyvH-dRo0JM1i8yygH50YN1ZHCCJxA", "enr:-Ku4QPp9z1W4tAO8Ber_NQierYaOStqhDqQdOPY3bB3jDgkjcbk6YrEnVYIiCBbTxuar3CzS528d2iE7TdJsrL-dEKoBh2F0dG5ldHOIAAAAAAAAAACEZXRoMpD1pf1CAAAAAP__________gmlkgnY0gmlwhBLf22SJc2VjcDI1NmsxoQMw5fqqkw2hHC4F5HZZDPsNmPdB1Gi8JPQK7pRc9XHh-oN1ZHCCKvg", "enr:-IS4QLkKqDMy_ExrpOEWa59NiClemOnor-krjp4qoeZwIw2QduPC-q7Kz4u1IOWf3DDbdxqQIgC4fejavBOuUPy-HE4BgmlkgnY0gmlwhCLzAHqJc2VjcDI1NmsxoQLQSJfEAHZApkm5edTCZ_4qps_1k_ub2CxHFxi-gr2JMIN1ZHCCIyg", "enr:-IS4QDAyibHCzYZmIYZCjXwU9BqpotWmv2BsFlIq1V31BwDDMJPFEbox1ijT5c2Ou3kvieOKejxuaCqIcjxBjJ_3j_cBgmlkgnY0gmlwhAMaHiCJc2VjcDI1NmsxoQJIdpj_foZ02MXz4It8xKD7yUHTBx7lVFn3oeRP21KRV4N1ZHCCIyg", "enr:-Ku4QHqVeJ8PPICcWk1vSn_XcSkjOkNiTg6Fmii5j6vUQgvzMc9L1goFnLKgXqBJspJjIsB91LTOleFmyWWrFVATGngBh2F0dG5ldHOIAAAAAAAAAACEZXRoMpC1MD8qAAAAAP__________gmlkgnY0gmlwhAMRHkWJc2VjcDI1NmsxoQKLVXFOhp2uX6jeT0DvvDpPcU8FWMjQdR4wMuORMhpX24N1ZHCCIyg", "enr:-Ku4QG-2_Md3sZIAUebGYT6g0SMskIml77l6yR-M_JXc-UdNHCmHQeOiMLbylPejyJsdAPsTHJyjJB2sYGDLe0dn8uYBh2F0dG5ldHOIAAAAAAAAAACEZXRoMpC1MD8qAAAAAP__________gmlkgnY0gmlwhBLY-NyJc2VjcDI1NmsxoQORcM6e19T1T9gi7jxEZjk_sjVLGFscUNqAY9obgZaxbIN1ZHCCIyg", "enr:-Ku4QPn5eVhcoF1opaFEvg1b6JNFD2rqVkHQ8HApOKK61OIcIXD127bKWgAtbwI7pnxx6cDyk_nI88TrZKQaGMZj0q0Bh2F0dG5ldHOIAAAAAAAAAACEZXRoMpC1MD8qAAAAAP__________gmlkgnY0gmlwhDayLMaJc2VjcDI1NmsxoQK2sBOLGcUb4AwuYzFuAVCaNHA-dy24UuEKkeFNgCVCsIN1ZHCCIyg", "enr:-Ku4QEWzdnVtXc2Q0ZVigfCGggOVB2Vc1ZCPEc6j21NIFLODSJbvNaef1g4PxhPwl_3kax86YPheFUSLXPRs98vvYsoBh2F0dG5ldHOIAAAAAAAAAACEZXRoMpC1MD8qAAAAAP__________gmlkgnY0gmlwhDZBrP2Jc2VjcDI1NmsxoQM6jr8Rb1ktLEsVcKAPa08wCsKUmvoQ8khiOl_SLozf9IN1ZHCCIyg"]
StaticNodes = []
TrustedNodes = []
ListenAddr = ":30303"
DiscAddr = ""
EnableMsgEvents = false

[Node.HTTPTimeouts]
ReadTimeout = 120000000000
ReadHeaderTimeout = 30000000000
WriteTimeout = 120000000000
IdleTimeout = 120000000000

Backtrace

Many of my nodes went OOM from Sun Jul 21 03:40:00 2024 UTC+8 to Sun Jul 21 04:20:00 2024 UTC+8 with the following node logs:

WARN [07-21|04:17:01.748] Served eth_call                          conn=172.31.239.100:52124 reqid=11 duration="795.22µs"  err="execution reverted: arithmetic underflow or overflow" errdata=0x4e487b710000000000000000000000000000000000000000000000000000000000000011
WARN [07-21|04:17:01.748] Served eth_call                          conn=172.31.239.100:52124 reqid=12 duration="419.286µs" err="execution reverted: arithmetic underflow or overflow" errdata=0x4e487b710000000000000000000000000000000000000000000000000000000000000011
WARN [07-21|04:17:01.749] Served eth_call                          conn=172.31.239.100:52124 reqid=13 duration="341.878µs" err="execution reverted: arithmetic underflow or overflow" errdata=0x4e487b710000000000000000000000000000000000000000000000000000000000000011
WARN [07-21|04:17:01.749] Served eth_call                          conn=172.31.239.100:52124 reqid=14 duration="286.569µs" err="execution reverted: arithmetic underflow or overflow" errdata=0x4e487b710000000000000000000000000000000000000000000000000000000000000011
WARN [07-21|04:17:01.749] Served eth_call                          conn=172.31.239.100:52124 reqid=15 duration="334.437µs" err="execution reverted: arithmetic underflow or overflow" errdata=0x4e487b710000000000000000000000000000000000000000000000000000000000000011
WARN [07-21|04:17:05.650] Served eth_call                          conn=172.31.239.100:35894 reqid=5167 duration="421.604µs" err="execution reverted: Multicall3: call failed" errdata=0x08c379a0000000000000000000000000000000000000000000000000000000000000002000000000000000000000000000000000000000000000000000000000000000174d756c746963616c6c333a2063616c6c206661696c6564000000000000000000
WARN [07-21|04:17:08.215] Served eth_call                          conn=172.31.119.40:45030 reqid=1 duration=162.91565ms err="execution reverted"
WARN [07-21|04:17:11.382] Served eth_call                          conn=172.31.119.40:45738 reqid=1 duration=424.717335ms err="execution reverted"
/opt/ethmain/supervisor/node_command.sh: line 84:   248 Killed                  ${COMMAND}
[2024-07-21 04:18:14] [node_command.sh] exec command [/opt/ethmain/core/geth --config=/mnt/ethmain/conf/config.toml --rpc.gascap=0 --rpc.txfeecap=0]
INFO [07-21|04:18:15.331] Starting Geth on Ethereum mainnet...
INFO [07-21|04:18:15.332] Bumping default cache on mainnet         provided=1024 updated=4096
INFO [07-21|04:18:15.347] Maximum peer count                       ETH=50 total=50
INFO [07-21|04:18:15.351] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
INFO [07-21|04:18:15.359] Enabling recording of key preimages since archive mode is used
INFO [07-21|04:18:15.360] Global gas cap disabled
INFO [07-21|04:18:15.360] Initializing the KZG library             backend=gokzg
INFO [07-21|04:18:15.557] Allocated trie memory caches             clean=1.20GiB dirty=0.00B
INFO [07-21|04:18:16.761] Using pebble as the backing database
INFO [07-21|04:18:16.761] Allocated cache and file handles         database=/mnt/ethmain/node/geth/geth/chaindata cache=2.00GiB handles=524,288
INFO [07-21|04:18:24.983] Opened ancient database                  database=/mnt/ethmain/node/geth/geth/chaindata/ancient/chain readonly=false
INFO [07-21|04:18:24.991] State scheme set by user                 scheme=path
INFO [07-21|04:18:25.018] Initialising Ethereum protocol           network=1 dbversion=8
INFO [07-21|04:18:26.266] Failed to load journal, discard it       err="unmatched journal want 3e6e1f70b7f00f15368c7084e7a11ec6476354855bcae3132bcdef4d77de8bb9 got 4e7307f51185c98d7d7b97f84de2e6b22ec9ac731d149db361eeda4dfd408240"
s1na commented 3 months ago

So here are the facts: your node's RPC server is exposed publicly. The gas cap is set to unlimited. So anyone can send arbitrarily complex eth_call requests to your node. Also your admin and personal namespace are exposed. Very bad. It's very easy for someone to stop your RPC server.

Either you or someone else have been sending (it seems) many parallel requests to the node. Some of them heavy too (e.g. last one is taking 400ms). Your setup is asking for it.

However if you do have access to the request bodies I can offer to investigate and see if there is anything in there that can cause a memory blowup.

holiman commented 3 months ago

Answered by @s1na , closing