Closed ttibord closed 1 month ago
try rm -rf txpool
@AskAlexSharov got a new error now, this time with chaindata
- detailing steps below because it feels weird:
[INFO] [09-03|11:25:08.206] Got interrupt, shutting down... sig=terminated
[INFO] [09-03|11:25:08.207] Got interrupt, shutting down...
[INFO] [09-03|11:25:08.211] Exiting...
[INFO] [09-03|11:25:08.211] Exiting Engine...
[INFO] [09-03|11:25:08.211] RPC server shutting down
[INFO] [09-03|11:25:08.213] RPC server shutting down
[INFO] [09-03|11:25:08.213] RPC server shutting down
[INFO] [09-03|11:25:08.213] Engine HTTP endpoint close url=127.0.0.1:8551
ethereum.service: Deactivated successfully.
Stopped ethereum.service - Erigon ethereum Node_TT.
ethereum.service: Consumed 21h 1min 46.685s CPU time.
NAME AVAIL USED REFER MOUNTPOINT TYPE CREATION
tank/ethereum/data/erigon@before_txpool_rm - 0B 1.60T - snapshot Tue Sep 3 11:25 2024
txpool
[INFO] [09-03|11:26:55.347] Reading JWT secret path=/ethereum/data/erigon/jwt.hex
[INFO] [09-03|11:26:55.353] HTTP endpoint opened for Engine API url=127.0.0.1:8551 ws=true ws.compression=true
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
[WARN] [09-03|11:26:55.369] NAT ExternalIP resolution has failed, try to pass a different --nat option err="no UPnP or NAT-PMP router discovered"
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
[WARN] [09-03|11:26:55.372] NAT ExternalIP resolution has failed, try to pass a different --nat option err="no UPnP or NAT-PMP router discovered"
[INFO] [09-03|11:26:55.377] Started P2P networking version=67 self=enode://de25cf66d74d26fcd9222c65189f9facd0234d6d37e94ac503d4e9524f90ea55b1bfd3c0b68cd11d3e746b77a6efa0068590832d13a32aea554ed240af23de27@127.0.0.1:30304 name=erigon/v2.60.6-d24e5d45/linux-amd64/go1.21.13
[INFO] [09-03|11:26:55.375] [1/12 Snapshots] Requesting downloads
[INFO] [09-03|11:26:55.379] Started P2P networking version=68 self=enode://de25cf66d74d26fcd9222c65189f9facd0234d6d37e94ac503d4e9524f90ea55b1bfd3c0b68cd11d3e746b77a6efa0068590832d13a32aea554ed240af23de27@127.0.0.1:30303 name=erigon/v2.60.6-d24e5d45/linux-amd64/go1.21.13
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
page_alloc_slowpath:10501 unable alloc 1 pages, flags 0x7, errcode -30796
txpool
contents and double check behaviour before concluding it didn't work:
root@chicago-3:/ethereum/data/erigon# zfs rollback tank/ethereum/data/erigon@before_txpool_rm
root@chicago-3:/ethereum/data/erigon# ls -la txpool total 25 drwxr--r-- 2 ethereum ethereum 4 Dec 8 2022 . drwxr-xr-x 12 ethereum ethereum 15 Aug 30 11:58 .. -rw-r--r-- 1 ethereum ethereum 16777216 Dec 8 2022 mdbx.dat -rw-r--r-- 1 ethereum ethereum 0 Dec 8 2022 mdbx.lck
root@chicago-3:/ethereum/data/erigon# rm -rf txpool root@chicago-3:/ethereum/data/erigon# systemctl start ethereum
root@chicago-3:/ethereum/data/erigon# ls -la txpool/ total 13 drwxr--r-- 2 ethereum ethereum 2 Sep 3 11:32 . drwxr-xr-x 12 ethereum ethereum 15 Sep 3 11:32 ..
And now the node keeps failing after a chaindata MDBX error:
``` bash
[INFO] [09-03|11:34:24.195] Opening Database label=chaindata path=/ethereum/data/erigon/chaindata
meta_checktxnid:11400 catch invalid root_page_txnid 55694987 for freedb.mod_txnid 55726472 (workaround for incoherent flaw of unified page/buffer cache)
meta_checktxnid:11415 catch invalid root_page_txnid 55707804 for maindb.mod_txnid 55726472 (workaround for incoherent flaw of unified page/buffer cache)
meta_waittxnid:11454 bailout waiting for valid snapshot (workaround for incoherent flaw of unified page/buffer cache)
[EROR] [09-03|11:34:24.197] Erigon startup err="mdbx_txn_begin: MDBX_CORRUPTED: Maybe free space is over on disk. Otherwise it's hardware failure. Before creating issue please use tools like https://www.memtest86.com to test RAM and tools like https://www.smartmontools.org to test Disk. To handle hardware risks: use ECC RAM, use RAID of disks, run multiple application instances (or do backups). If hardware checks passed - check FS settings - 'fsync' and 'flock' must be enabled. Otherwise - please create issue in Application repo. On default DURABLE mode, power outage can't cause this error. On other modes - power outage may break last transaction and mdbx_chk can recover db in this case, see '-t' and '-0|1|2' options., label: chaindata, trace: [kv_mdbx.go:363 node.go:367 node.go:370 backend.go:246 node.go:124 main.go:66 make_app.go:54 command.go:276 app.go:333 app.go:307 main.go:34 proc.go:267 asm_amd64.s:1650]"
mdbx_txn_begin: MDBX_CORRUPTED: Maybe free space is over on disk. Otherwise it's hardware failure. Before creating issue please use tools like https://www.memtest86.com to test RAM and tools like https://www.smartmontools.org to test Disk. To handle hardware risks: use ECC RAM, use RAID of disks, run multiple application instances (or do backups). If hardware checks passed - check FS settings - 'fsync' and 'flock' must be enabled. Otherwise - please create issue in Application repo. On default DURABLE mode, power outage can't cause this error. On other modes - power outage may break last transaction and mdbx_chk can recover db in this case, see '-t' and '-0|1|2' options., label: chaindata, trace: [kv_mdbx.go:363 node.go:367 node.go:370 backend.go:246 node.go:124 main.go:66 make_app.go:54 command.go:276 app.go:333 app.go:307 main.go:34 proc.go:267 asm_amd64.s:1650]
Additional:
ubuntu@chicago-3:~$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank/ethereum 2.19T 20.3T 1.88G /ethereum
tank/ethereum/data 2.19T 20.3T 151K /ethereum/data
tank/ethereum/data/erigon 2.10T 20.3T 1.60T /ethereum/data/erigon
tank/ethereum/data/lighthouse 87.3G 20.3T 87.3G /ethereum/data/lighthouse
ubuntu@chicago-3:/gnosis$ sudo systemctl stop gnosis
ubuntu@chicago-3:/gnosis$ sudo systemctl stop gnosis-lighthouse
ubuntu@chicago-3:/gnosis$ sudo zfs snapshot tank/gnosis/data/erigon@testing_snapshot
ubuntu@chicago-3:/gnosis$ sudo rm -rf data/erigon/txpool
ubuntu@chicago-3:/gnosis$ sudo systemctl start gnosis
ubuntu@chicago-3:/gnosis$ sudo ls -latrh data/erigon/txpool
total 13K
drwxr-xr-x 12 gnosis gnosis 15 Sep 3 12:10 ..
drwxr--r-- 2 gnosis gnosis 2 Sep 3 12:10 .
ubuntu@chicago-3:/gnosis$ sudo systemctl stop gnosis
ubuntu@chicago-3:/gnosis$ zfs list -t snapshot | grep testing
tank/gnosis/data/erigon@testing_snapshot - 31.2M 612G - snapshot Tue Sep 3 12:10 2024
ubuntu@chicago-3:/gnosis$ sudo zfs rollback tank/gnosis/data/erigon@testing_snapshot
ubuntu@chicago-3:/gnosis$ zfs list -t snapshot | grep testing
tank/gnosis/data/erigon@testing_snapshot - 0B 612G - snapshot Tue Sep 3 12:10 2024
ubuntu@chicago-3:/gnosis$ sudo systemctl start gnosis
ubuntu@chicago-3:/gnosis$ zfs list -t snapshot | grep testing
tank/gnosis/data/erigon@testing_snapshot - 21.5M 612G - snapshot Tue Sep 3 12:10 2024
Any ideas?
fsync
chaindata
please use tools like https://www.memtest86.com to test RAM and tools like https://www.smartmontools.org to test Disk. To handle hardware risks: use ECC RAM, use RAID of disks, run multiple application instances (or do backups).
mdbx_chk can recover db in this case, see '-t' and '-0|1|2' options
Damn, yeah, ZFS sync
is disabled, sorry for disbelieving that - inherited flag, that just didn't prove problematic for long enough. I guess all bets are off then, thank you!
Is it safe to assume that the originating node running with sync disabled but not throwing errors is healthy, or should I re-sync it to be sure?
high chance that it's healthy - absence of fsync is not dangerous until server-power-outage. can run mdbx_chk
but it will take long time.
Great, thank you!
System information
Erigon version: 2.60.6-d24e5d45 OS & Version: Linux, Ubuntu 24.04, 6.8.0-40-generic Commit hash: d24e5d45755d7b23075c507ad9216e1d60ad03de Erigon Command (with flags/config):
Consensus Layer: Lighthouse 5.3.0 Consensus Layer Command (with flags/config):
Chain/Network: Ethereum Mainnet
Expected behaviour
Node to sync without warnings or errors, after running Erigon v2.60.6 on a new machine using DB snapshot of Erigon v2.58.1.
Actual behaviour
Node caught up normally, and continues to sync but spams errors:
This is what start-up and first occurrence of error looks like:
Same behaviour when CL(Lighthouse 5.3.0) is running and stopped.
Please advise.