Node syncs but reports page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796

ttibord commented 1 month ago

System information

Erigon version: 2.60.6-d24e5d45 OS & Version: Linux, Ubuntu 24.04, 6.8.0-40-generic Commit hash: d24e5d45755d7b23075c507ad9216e1d60ad03de Erigon Command (with flags/config):

# CMD: exec /ethereum/erigon/build/dist/erigon --config /ethereum/erigon-ethereum.toml
# CFG: erigon-ethereum.toml
chain="mainnet"
datadir = "/ethereum/data/erigon"
snapshots = true
"txpool.disable" = true
"db.pagesize" = "4k"
"db.size.limit" = "4TB"
port = 30303
"p2p.allowed-ports" = [30303, 30304, 30305, 30306, 30307]
"authrpc.port" = 8551
http = true
"http.port" = 8545
"http.addr" = "0.0.0.0"
"http.api" = "eth,erigon,web3,net,trace,debug"
"http.corsdomain" = "*"
"http.vhosts" = "*"
metrics = true
"metrics.port" = 6065
"metrics.addr" = "0.0.0.0"
"private.api.addr" = "127.0.0.1:9095"
"torrent.port" = 42069
"torrent.download.rate" = "512mb"
"torrent.upload.rate" = "512mb"

Consensus Layer: Lighthouse 5.3.0 Consensus Layer Command (with flags/config):

exec /ethereum/lighthouse/target/release/lighthouse --network mainnet beacon \
  --checkpoint-sync-url=https://mainnet.checkpoint.sigp.io \
  --datadir=/ethereum/data/lighthouse \
  --execution-endpoint=http://localhost:8551 \
  --execution-jwt=/ethereum/data/erigon/jwt.hex \
  --reconstruct-historic-states \
  --metrics --metrics-address=0.0.0.0 --metrics-port=6061 \
  --http --http-address=0.0.0.0 --http-port=5052 \
  --port 9000 \
  --prune-blobs=false \
  --prune-payloads=false

Chain/Network: Ethereum Mainnet

Expected behaviour

Node to sync without warnings or errors, after running Erigon v2.60.6 on a new machine using DB snapshot of Erigon v2.58.1.

Actual behaviour

Node caught up normally, and continues to sync but spams errors:

page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
[INFO] [09-02|13:00:56.328] [NewPayload] Handling new payload        height=20663037 hash=0x14668121ade1811fb7921111d78a990f0848c0ac58f5e23046b93b6f4da40eef
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
[INFO] [09-02|13:00:58.126] [updateForkchoice] Fork choice update: flushing in-memory state (built by previous newPayload)
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796

This is what start-up and first occurrence of error looks like:

[INFO] [09-02|14:26:50.563] [db] open                                label=chaindata sizeLimit=4TB pageSize=4096
[INFO] [09-02|14:26:50.567] Initialised chain configuration          config="{ChainID: 1, Homestead: 1150000, DAO: 1920000, Tangerine Whistle: 2463000, Spurious Dragon: 2675000, Byzantium: 4370000, Constantinople: 7280000, Petersburg: 7280000, Istanbul: 9069000, Muir Glacier: 9200000, Berlin: 12244000, London: 12965000, Arrow Glacier: 13773000, Gray Glacier: 15050000, Terminal Total Difficulty: 58750000000000000000000, Merge Netsplit: <nil>, Shanghai: 1681338455, Cancun: 1710338135, Prague: <nil>, Osaka: <nil>, Engine: ethash, NoPruneContracts: map[0x00000000219ab540356cBB839Cbe05303d7705Fa:true]}" genesis=0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3
[INFO] [09-02|14:27:35.110] Initialising Ethereum protocol           network=1
[INFO] [09-02|14:27:35.111] Disk storage enabled for ethash DAGs     dir=/ethereum/data/erigon/ethash-dags count=2
[INFO] [09-02|14:27:36.934] Starting private RPC server              on=127.0.0.1:9095
[INFO] [09-02|14:27:36.934] new subscription to logs established
[INFO] [09-02|14:27:36.935] rpc filters: subscribing to Erigon events
[INFO] [09-02|14:27:36.935] new subscription to newHeaders established
[EROR] [09-02|14:27:36.935] [Diagnostics] Failure in running diagnostics server err="listen tcp 0.0.0.0:6060: bind: address already in use"
[INFO] [09-02|14:27:36.936] Reading JWT secret                       path=/ethereum/data/erigon/jwt.hex
[INFO] [09-02|14:27:36.936] JsonRpc endpoint opened                  ws=false ws.compression=true grpc=false http.url=[::]:8545
[INFO] [09-02|14:27:36.937] HTTP endpoint opened for Engine API      url=127.0.0.1:8551 ws=true ws.compression=true
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
[WARN] [09-02|14:27:36.951] NAT ExternalIP resolution has failed, try to pass a different --nat option err="no UPnP or NAT-PMP router discovered"
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
[INFO] [09-02|14:27:36.955] Started P2P networking                   version=68 self=enode://de25cf66d74d26fcd9222c65189f9facd0234d6d37e94ac503d4e9524f90ea55b1bfd3c0b68cd11d3e746b77a6efa0068590832d13a32aea554ed240af23de27@127.0.0.1:30303 name=erigon/v2.60.6-d24e5d45/linux-amd64/go1.21.13
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
[WARN] [09-02|14:27:36.956] NAT ExternalIP resolution has failed, try to pass a different --nat option err="no UPnP or NAT-PMP router discovered"
[INFO] [09-02|14:27:36.956] [1/12 Snapshots] Requesting downloads
[INFO] [09-02|14:27:36.964] Started P2P networking                   version=67 self=enode://de25cf66d74d26fcd9222c65189f9facd0234d6d37e94ac503d4e9524f90ea55b1bfd3c0b68cd11d3e746b77a6efa0068590832d13a32aea554ed240af23de27@127.0.0.1:30304 name=erigon/v2.60.6-d24e5d45/linux-amd64/go1.21.13
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796

Same behaviour when CL(Lighthouse 5.3.0) is running and stopped.

Please advise.

AskAlexSharov commented 1 month ago

try rm -rf txpool

ttibord commented 1 month ago

@AskAlexSharov got a new error now, this time with chaindata - detailing steps below because it feels weird:

stopped node, seems graceful:

[INFO] [09-03|11:25:08.206] Got interrupt, shutting down...          sig=terminated
[INFO] [09-03|11:25:08.207] Got interrupt, shutting down...
[INFO] [09-03|11:25:08.211] Exiting...
[INFO] [09-03|11:25:08.211] Exiting Engine...
[INFO] [09-03|11:25:08.211] RPC server shutting down
[INFO] [09-03|11:25:08.213] RPC server shutting down
[INFO] [09-03|11:25:08.213] RPC server shutting down
[INFO] [09-03|11:25:08.213] Engine HTTP endpoint close               url=127.0.0.1:8551
ethereum.service: Deactivated successfully.
Stopped ethereum.service - Erigon ethereum Node_TT.
ethereum.service: Consumed 21h 1min 46.685s CPU time.

created ZFS snapshot of the node DB

NAME                                                    AVAIL   USED  REFER  MOUNTPOINT  TYPE      CREATION
tank/ethereum/data/erigon@before_txpool_rm                  -     0B  1.60T  -           snapshot  Tue Sep  3 11:25 2024

deleted txpool

started node again - exactly same start-up as above

[INFO] [09-03|11:26:55.347] Reading JWT secret                       path=/ethereum/data/erigon/jwt.hex
[INFO] [09-03|11:26:55.353] HTTP endpoint opened for Engine API      url=127.0.0.1:8551 ws=true ws.compression=true
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
[WARN] [09-03|11:26:55.369] NAT ExternalIP resolution has failed, try to pass a different --nat option err="no UPnP or NAT-PMP router discovered"
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
[WARN] [09-03|11:26:55.372] NAT ExternalIP resolution has failed, try to pass a different --nat option err="no UPnP or NAT-PMP router discovered"
[INFO] [09-03|11:26:55.377] Started P2P networking                   version=67 self=enode://de25cf66d74d26fcd9222c65189f9facd0234d6d37e94ac503d4e9524f90ea55b1bfd3c0b68cd11d3e746b77a6efa0068590832d13a32aea554ed240af23de27@127.0.0.1:30304 name=erigon/v2.60.6-d24e5d45/linux-amd64/go1.21.13
[INFO] [09-03|11:26:55.375] [1/12 Snapshots] Requesting downloads
[INFO] [09-03|11:26:55.379] Started P2P networking                   version=68 self=enode://de25cf66d74d26fcd9222c65189f9facd0234d6d37e94ac503d4e9524f90ea55b1bfd3c0b68cd11d3e746b77a6efa0068590832d13a32aea554ed240af23de27@127.0.0.1:30303 name=erigon/v2.60.6-d24e5d45/linux-amd64/go1.21.13
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796

proceeded to rollback and repeat the procedure - so I can paste the previous txpool contents and double check behaviour before concluding it didn't work:
```
root@chicago-3:/ethereum/data/erigon# zfs rollback tank/ethereum/data/erigon@before_txpool_rm
```

root@chicago-3:/ethereum/data/erigon# ls -la txpool total 25 drwxr--r-- 2 ethereum ethereum 4 Dec 8 2022 . drwxr-xr-x 12 ethereum ethereum 15 Aug 30 11:58 .. -rw-r--r-- 1 ethereum ethereum 16777216 Dec 8 2022 mdbx.dat -rw-r--r-- 1 ethereum ethereum 0 Dec 8 2022 mdbx.lck

root@chicago-3:/ethereum/data/erigon# rm -rf txpool root@chicago-3:/ethereum/data/erigon# systemctl start ethereum

root@chicago-3:/ethereum/data/erigon# ls -la txpool/ total 13 drwxr--r-- 2 ethereum ethereum 2 Sep 3 11:32 . drwxr-xr-x 12 ethereum ethereum 15 Sep 3 11:32 ..


And now the node keeps failing after a chaindata MDBX error:
``` bash
[INFO] [09-03|11:34:24.195] Opening Database                         label=chaindata path=/ethereum/data/erigon/chaindata
meta_checktxnid:11400 catch invalid root_page_txnid 55694987 for freedb.mod_txnid 55726472 (workaround for incoherent flaw of unified page/buffer cache)
meta_checktxnid:11415 catch invalid root_page_txnid 55707804 for maindb.mod_txnid 55726472 (workaround for incoherent flaw of unified page/buffer cache)
meta_waittxnid:11454 bailout waiting for valid snapshot (workaround for incoherent flaw of unified page/buffer cache)
[EROR] [09-03|11:34:24.197] Erigon startup                           err="mdbx_txn_begin: MDBX_CORRUPTED: Maybe free space is over on disk. Otherwise it's hardware failure. Before creating issue please use tools like https://www.memtest86.com to test RAM and tools like https://www.smartmontools.org to test Disk. To handle hardware risks: use ECC RAM, use RAID of disks, run multiple application instances (or do backups). If hardware checks passed - check FS settings - 'fsync' and 'flock' must be enabled.  Otherwise - please create issue in Application repo. On default DURABLE mode, power outage can't cause this error. On other modes - power outage may break last transaction and mdbx_chk can recover db in this case, see '-t' and '-0|1|2' options., label: chaindata, trace: [kv_mdbx.go:363 node.go:367 node.go:370 backend.go:246 node.go:124 main.go:66 make_app.go:54 command.go:276 app.go:333 app.go:307 main.go:34 proc.go:267 asm_amd64.s:1650]"
mdbx_txn_begin: MDBX_CORRUPTED: Maybe free space is over on disk. Otherwise it's hardware failure. Before creating issue please use tools like https://www.memtest86.com to test RAM and tools like https://www.smartmontools.org to test Disk. To handle hardware risks: use ECC RAM, use RAID of disks, run multiple application instances (or do backups). If hardware checks passed - check FS settings - 'fsync' and 'flock' must be enabled.  Otherwise - please create issue in Application repo. On default DURABLE mode, power outage can't cause this error. On other modes - power outage may break last transaction and mdbx_chk can recover db in this case, see '-t' and '-0|1|2' options., label: chaindata, trace: [kv_mdbx.go:363 node.go:367 node.go:370 backend.go:246 node.go:124 main.go:66 make_app.go:54 command.go:276 app.go:333 app.go:307 main.go:34 proc.go:267 asm_amd64.s:1650]

Additional:

tried rolling back a couple of more times, checking there are no diffs from the last time node worked
ZFS reports no errors on the pool

there is plenty of available space:

ubuntu@chicago-3:~$ zfs list
NAME                            USED  AVAIL  REFER  MOUNTPOINT
tank/ethereum                  2.19T  20.3T  1.88G  /ethereum
tank/ethereum/data             2.19T  20.3T   151K  /ethereum/data
tank/ethereum/data/erigon      2.10T  20.3T  1.60T  /ethereum/data/erigon
tank/ethereum/data/lighthouse  87.3G  20.3T  87.3G  /ethereum/data/lighthouse

tried testing the exact same flow on a Gnosis node on the same machine and pool, running same Erigon version, 2.60.6, and it resumed operations normally and has synced since:

ubuntu@chicago-3:/gnosis$ sudo systemctl stop gnosis
ubuntu@chicago-3:/gnosis$ sudo systemctl stop gnosis-lighthouse
ubuntu@chicago-3:/gnosis$ sudo zfs snapshot tank/gnosis/data/erigon@testing_snapshot
ubuntu@chicago-3:/gnosis$ sudo rm -rf data/erigon/txpool
ubuntu@chicago-3:/gnosis$ sudo systemctl start gnosis
ubuntu@chicago-3:/gnosis$ sudo ls -latrh data/erigon/txpool
total 13K
drwxr-xr-x 12 gnosis gnosis 15 Sep  3 12:10 ..
drwxr--r--  2 gnosis gnosis  2 Sep  3 12:10 .
ubuntu@chicago-3:/gnosis$ sudo systemctl stop gnosis
ubuntu@chicago-3:/gnosis$ zfs list -t snapshot | grep testing
tank/gnosis/data/erigon@testing_snapshot                    -  31.2M   612G  -           snapshot  Tue Sep  3 12:10 2024
ubuntu@chicago-3:/gnosis$ sudo zfs rollback tank/gnosis/data/erigon@testing_snapshot
ubuntu@chicago-3:/gnosis$ zfs list -t snapshot | grep testing
tank/gnosis/data/erigon@testing_snapshot                    -     0B   612G  -           snapshot  Tue Sep  3 12:10 2024
ubuntu@chicago-3:/gnosis$ sudo systemctl start gnosis
ubuntu@chicago-3:/gnosis$ zfs list -t snapshot | grep testing
tank/gnosis/data/erigon@testing_snapshot                    -  21.5M   612G  -           snapshot  Tue Sep  3 12:10 2024

Any ideas?

AskAlexSharov commented 1 month ago

maybe you disabled fsync
maybe you did non-consistent backup of chaindata

maybe you have hardware issue. follow advise:

please use tools like https://www.memtest86.com to test RAM and tools like https://www.smartmontools.org to test Disk. To handle hardware risks: use ECC RAM, use RAID of disks, run multiple application instances (or do backups).

you also can try: mdbx_chk can recover db in this case, see '-t' and '-0|1|2' options

ttibord commented 1 month ago

Damn, yeah, ZFS sync is disabled, sorry for disbelieving that - inherited flag, that just didn't prove problematic for long enough. I guess all bets are off then, thank you!

Is it safe to assume that the originating node running with sync disabled but not throwing errors is healthy, or should I re-sync it to be sure?

AskAlexSharov commented 1 month ago

high chance that it's healthy - absence of fsync is not dangerous until server-power-outage. can run mdbx_chk but it will take long time.

ttibord commented 1 month ago

Great, thank you!

erigontech / erigon