Execution failure on block 5743045

erigontech / erigon

Ethereum implementation on the efficiency frontier https://erigon.gitbook.io

GNU Lesser General Public License v3.0

3.09k stars 1.08k forks source link

Execution failure on block 5743045 #11044

Open suxnju opened 2 months ago

suxnju commented 2 months ago

System information Erigon version: erigon version 2.60.2-2f41075a OS & Version: Ubuntu 20.04.6

Expected behaviour Everything works :)

Actual behaviour Execution failed on block 5743045

Steps to reproduce the behaviour I synced erigon on ethereum mainnet via

erigon \
    --datadir "$DATADIR" \
    --chain mainnet \
    --port=30303 --http.port=8545 --authrpc.port=8551  --http.corsdomain "*" \
    --private.api.addr=127.0.0.1:9090 --http --ws --http.api=eth,debug,net,trace,web3,erigon \
    --authrpc.jwtsecret="$JWTSECRET" \
    --torrent.port=42069 --torrent.download.rate=512mb \
    --metrics \

Snapshot was downloaded and indexed smoothly. But on execution I got following warning

[INFO] [07-05|15:15:36.471] [4/12 Execution] Executed blocks         number=5739334 blk/s=168.3 tx/s=23677.6 Mgas/s=1254.8 gasState=0.22 batch=150.3MB alloc=6.2GB sys=11.2GB
[WARN] [07-05|15:15:58.396] [4/12 Execution] Execution failed        block=5743045 hash=0xe242a6e3b9f015a7e5b9c2e1f23772c9981461a478cad2e4cb4b735f7f8df307 err="invalid block: could not apply tx 136 from block 5743045 [0xe911733c0a0eb71e883cc3f74434d610a3e524bab5086a3ac1cf5d4342861561]: nonce too high: address 0x90622E3Ce5142E69c7549671daDb98da425FE31F, tx: 2 state: 0"
[INFO] [07-05|15:15:59.398] [] Flushed buffer file                   name=erigon-sortable-buf-1375149095
[INFO] [07-05|15:16:00.375] [] Flushed buffer file                   name=erigon-sortable-buf-4294828698
[INFO] [07-05|15:16:00.808] [] Flushed buffer file                   name=erigon-sortable-buf-3756210370
[INFO] [07-05|15:16:30.467] [4/12 Execution] Completed on            block=5743044

After following the method described in the 4037, I used the following command to re-run the process:

./build/bin/integration stage_headers --reset --datadir=<DATADIR> --chain=mainnet

However, the issue still persists:

[INFO] [07-05|16:02:31.879] [1/12 Snapshots] Requesting downloads 
[INFO] [07-05|16:02:33.353] [snapshots:download] Stat                blocks=19800k indices=19800k alloc=2.8GB sys=5.0GB
[INFO] [07-05|16:02:33.361] [4/12 Execution] Blocks execution        from=5743044 to=19799999
[WARN] [07-05|16:02:33.367] [4/12 Execution] Execution failed        block=5743045 hash=0xe242a6e3b9f015a7e5b9c2e1f23772c9981461a478cad2e4cb4b735f7f8df307 err="invalid block: could not apply tx 136 from block 5743045 [0xe911733c0a0eb71e883cc3f74434d610a3e524bab5086a3ac1cf5d4342861561]: nonce too high: address 0x90622E3Ce5142E69c7549671daDb98da425FE31F, tx: 2 state: 0"
[INFO] [07-05|16:02:33.369] [4/12 Execution] Completed on            block=5743044

The integration tools indicate that I'm in the following state:

[INFO] [07-05|16:59:06.736] [db] open                                label=chaindata sizeLimit=12TB pageSize=8192
[INFO] [07-05|16:59:08.569] [snapshots:all] Stat                     blocks=19800k indices=19800k alloc=2.8GB sys=2.8GB
[INFO] [07-05|16:59:08.570] [snapshots:all] Stat                     blocks=0k indices=0k alloc=2.8GB sys=2.8GB
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.

    stage_at       prune_at
Snapshots         19799999   0
Headers           19799999   0
BorHeimdall       0         0
BlockHashes       5743044    0
Bodies            19799999   0
Senders           5743044    0
Execution         5743044    5743044
Translation       0         0
HashState         0         0
IntermediateHashes 0        0
AccountHistoryIndex 0       0
StorageHistoryIndex 0       0
LogIndex          0         0
CallTraces        0         0
TxLookup          0         40
Finish            0         0
--
prune distance: 

blocks.v2: true, blocks=19799999, segments=19799999, indices=19799999
blocks.bor.v2: segments=0, indices=0

history.v3: false,  idx steps: 0.00, lastBlockInSnap=0, TxNums_Index(0,1)

sequence: EthTx=2397678597, NonCanonicalTx=0

in db: first header 0, last header 0, first body 0, last body 0
--

Any help on how to fix or debug this would be appreciated.

AskAlexSharov commented 2 months ago

try:

integration state_stages --unwind=100
integration stage_headers --unwind=100

start erigon

suxnju commented 2 months ago

@AskAlexSharov The issue is not resolved, and now it has changed to "execution failure on block 3765614" after integration.

[INFO] [07-05|19:12:55.349] P2P                                      app=caplin peers=69
[INFO] [07-05|19:13:19.858] Committed State                          gas reached=429206618002 gasTarget=549755813888 block=3763817 time=32.146131558s committedToDb=true
[INFO] [07-05|19:13:19.859] [4/12 Execution] Executed blocks         number=3763817 blk/s=468.6 tx/s=15005.1 Mgas/s=811.8 gasState=0.00 batch=0B alloc=8.9GB sys=12.7GB
[WARN] [07-05|19:13:21.330] [4/12 Execution] Execution failed        block=3765614 hash=0x6b8c925c8dde7c998f49fffb36e755c9f3e26080d7edc807f0d5e612aeb057af err="invalid block: could not apply tx 19 from block 3765614 [0x3aa8a59e6b21dc52004e9a9447187df0174a96ca8813fd26161557170d45b32f]: nonce too high: address 0x906227ba18dC0C6ed3325E510bef5Eb88598bC4D, tx: 3 state: 0"
[INFO] [07-05|19:13:22.719] [4/12 Execution] Completed on            block=3765613
[INFO] [07-05|19:13:22.720] [4/12 Execution] DONE                    in=4m54.829447868s

In fact, I encountered issues while running the integration command as well:

[INFO] [07-05|17:26:40.389] [9/15 IntermediateHashes] Calculating Merkle root current key=ef6bba95
[INFO] [07-05|17:26:46.268] [9/15 IntermediateHashes] Regeneration ended 
[EROR] [07-05|17:26:46.269] [9/15 IntermediateHashes] Wrong trie root of block 5743044: a804b52a9fd095da4a45448530cf0eb5e543340ca1533ae7028e913153e4e882, expected (from header): ea7f81807ee115e79348ca42be67658752cbfd60a355bcf5d47e3aaa19dfa401. Block hash: 6a94d37476d717bd88e71792c768f0492060049a6f4a64f56dc5b1346cc589eb 
[WARN] [07-05|17:26:46.269] Unwinding due to incorrect root hash     to=2871522
[INFO] [07-05|17:26:46.269] [9/15 IntermediateHashes] DONE           in=2m5.884770842s
[INFO] [07-05|17:26:46.269] [8/15 HashState] Unwinding started       from=5743044 to=2871522 storage=false codes=true
...

Here is the complete integration log: integration .log

The other runs normally. Should I run the integration command again?

suxnju commented 2 months ago

[INFO] [07-05|19:47:52.748] logging to file system                   log dir=/mnt/data/erigon_sync/data/erigon_full/logs file prefix=integration log level=info json=false
[INFO] [07-05|19:47:52.755] [db] open                                label=chaindata sizeLimit=12TB pageSize=8192
[INFO] [07-05|19:47:54.565] [snapshots:all] Stat                     blocks=19800k indices=19800k alloc=2.9GB sys=3.0GB
[INFO] [07-05|19:47:54.566] [snapshots:all] Stat                     blocks=0k indices=0k alloc=2.9GB sys=3.0GB
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.

                 stage_at    prune_at
Snapshots            19799999    0
Headers              19799999    0
BorHeimdall              0       0
BlockHashes              3765613     0
Bodies               19799999    0
Senders              3765613     0
Execution            3765613     3765613
Translation              0       0
HashState            2871522     0
IntermediateHashes       0       0
AccountHistoryIndex          0       0
StorageHistoryIndex          0       0
LogIndex             0       0
CallTraces           0       0
TxLookup             0       50
Finish               0       0
--
prune distance: 

blocks.v2: true, blocks=19799999, segments=19799999, indices=19799999
blocks.bor.v2: segments=0, indices=0

history.v3: false,  idx steps: 0.00, lastBlockInSnap=0, TxNums_Index(0,1)

sequence: EthTx=2397678597, NonCanonicalTx=0

in db: first header 0, last header 0, first body 0, last body 0
--

AskAlexSharov commented 2 months ago

my advise: rm -rf chaindata and restart erigon i don't understand why "Stage Senders" is below snapshots (snapshots include this data).

suxnju commented 2 months ago

This is not my first time using Erigon to sync mainnet.

Before, I successfully synced up to around 20 million blocks, but then the database got corrupted, preventing further syncing.

So, I decided to re-sync using a new, empty datadir. Is it possible that there might be some shared files between the two syncs?

suxnju commented 2 months ago

I have removed the chaindata, and the output of print_stages is

[INFO] [07-05|20:07:03.741] logging to file system                   log dir=/mnt/data/erigon_sync/data/erigon_full/logs file prefix=integration log level=info json=false
[INFO] [07-05|20:07:03.741] [db] open                                label=chaindata sizeLimit=12TB pageSize=8192
[INFO] [07-05|20:07:05.768] [snapshots:all] Stat                     blocks=19800k indices=19800k alloc=2.6GB sys=3.0GB
[INFO] [07-05|20:07:05.769] [snapshots:all] Stat                     blocks=0k indices=0k alloc=2.6GB sys=3.0GB
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.

                 stage_at    prune_at
Snapshots            19799999    0
Headers              19799999    0
BorHeimdall              0       0
BlockHashes              19799999    0
Bodies               19799999    0
Senders              19799999    0
Execution            0       0
Translation              0       0
HashState            0       0
IntermediateHashes       0       0
AccountHistoryIndex          0       0
StorageHistoryIndex          0       0
LogIndex             0       0
CallTraces           0       0
TxLookup             0       0
Finish               0       0
--
prune distance: 

blocks.v2: true, blocks=19799999, segments=19799999, indices=19799999
blocks.bor.v2: segments=0, indices=0

history.v3: false,  idx steps: 0.00, lastBlockInSnap=0, TxNums_Index(0,1)

sequence: EthTx=2397678597, NonCanonicalTx=0

in db: first header 0, last header 0, first body 0, last body 0
--

AskAlexSharov commented 2 months ago

yes. look good. just run.

suxnju commented 2 months ago

Yes, it appears to be fixed! Great Thanks!

suxnju commented 1 month ago

@AskAlexSharov Sorry to bother you again, but I am still encountering the Execution failed issue during synchronization.

For example at block 10684421:

[WARN] [07-07|19:47:49.484] [4/12 Execution] Execution failed        block=10684421 hash=0x2984d3c8054875f80fd4bcb38d67aa61926875b8898abd4de95775a9d58e17d1 err="invalid block: mismatched receipt headers for block 10684421 (0x05c14082a5805af786cc2d05d967cd5a6bf9d2f3c572a27829fba3c29979e6df != 0x6efd8fb91240344aee78c79995ed0c48910a175cc81ba5f197baa2774be649d8)"
[INFO] [07-07|19:47:49.484] [4/12 Execution] Completed on            block=10684420
[WARN] [07-07|19:47:49.484] bad forkchoice                           head=0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3 hash=0xb818e5a3596e0f2589baeb0249051f5626421e90f5c94a18d15eb224108b9509[INFO] [07-09|02:42:33.298] [4/12 Execution] Blocks execution        from=10311398 to=20263599

Then

Step 1: Continue using Unwind

integration state_stages --unwind=100 \
    --datadir "$DATADIR" \
    --chain mainnet

integration stage_headers --unwind=100 \
    --datadir "$DATADIR" \
    --chain mainnet

It takes approximately 12 hours to execute,

[EROR] [07-08|01:23:31.142] [9/15 IntermediateHashes] Wrong trie root of block 10684420: 44f1e67144899d1dab3e58610c8fe03a4f1832af38ce9ef9ece3ce50b04f1a56, expected (from header): e32869f3aeb4a0bfe6b7f6335d94a5e4df62218fb60cca92119caf0958a818e5. Block hash: 6e6a4427444a6eb1cb7b304e5f9d3c937877c6c4241bbe98c508934448c4e5ab 
[WARN] [07-08|01:23:31.143] Unwinding due to incorrect root hash     to=5342210
[INFO] [07-08|01:23:31.144] [9/15 IntermediateHashes] DONE           in=9m59.859359969s
[INFO] [07-08|01:23:31.144] [8/15 HashState] Unwinding started       from=10684420 to=5342210 storage=false codes=true
...
[INFO] [07-08|05:15:06.616] [8/15 HashState] Unwind done             in=3h51m35.47232704s
...
[INFO] [07-08|10:58:16.136] [7/15 Execution] Unwind done             in=5h43m9.518812562s
...
[INFO] [07-08|01:13:31.283] [8/15 HashState] DONE                    in=4h53m59.846678007s

Step 2: Resynchronize. The Execution failed at block 10311398 issue occurs again

[WARN] [07-09|02:42:33.312] [4/12 Execution] Execution failed        block=10311399 hash=0xc79e932d1f9debcefe68fa2ee8bb0777afe600dfcbd4ba84f5442633c6cfaf0f err="invalid block: could not apply tx 109 from block 10311399 [0x47660aad062d1c9be1968cf45c08dfe8d3c69d2bf9deac33c4140f4371ed297f]: gas limit reached"
[INFO] [07-09|02:42:33.312] [4/12 Execution] Completed on            block=10311398
[WARN] [07-09|02:42:33.313] bad forkchoice                           head=0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3 hash=0x4a0f9715bf89b0974c02a3f86fb2e6def7c225b50d4e42618a3c6d05af4e36a7

The current status is as follows:

[INFO] [07-09|12:41:48.634] [db] open                                label=chaindata sizeLimit=12TB pageSize=8192
[INFO] [07-09|12:41:52.885] [snapshots:all] Stat                     blocks=19800k indices=19800k alloc=2.9GB sys=3.0GB
[INFO] [07-09|12:41:52.893] [snapshots:all] Stat                     blocks=0k indices=0k alloc=2.9GB sys=3.0GB
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.

                 stage_at    prune_at
Snapshots            19799999    0
Headers              19799999    0
BorHeimdall              0       0
BlockHashes              10311398    0
Bodies               19799999    0
Senders              10311398    0
Execution            10311398    10311398
Translation              0       0
HashState            5342210     0
IntermediateHashes       0       0
AccountHistoryIndex          0       0
StorageHistoryIndex          0       0
LogIndex             0       0
CallTraces           0       0
TxLookup             0       30
Finish               0       0
--
prune distance: 

blocks.v2: true, blocks=19799999, segments=19799999, indices=19799999
blocks.bor.v2: segments=0, indices=0

history.v3: false,  idx steps: 0.00, lastBlockInSnap=0, TxNums_Index(0,1)

sequence: EthTx=2474175239, NonCanonicalTx=0

in db: first header 19800000, last header 20266618, first body 19800000, last body 20266618
--

If I encounter a similar issue again, should I continue to repeat the Step 1 and Step 2 strategies?

suxnju commented 1 month ago

@AskAlexSharov Sorry to bother you again. I'm encountering a SIGSEGV: segmentation violation error when using the integration command. Similar to before, the commands I used are as follows:

integration state_stages --unwind=100 \
    --datadir "$DATADIR" \
    --chain mainnet

integration stage_headers --unwind=100 \
    --datadir "$DATADIR" \
    --chain mainnet

Detailed error log: integration .log Current stages:

[INFO] [07-15|15:55:58.571] logging to file system                   log dir=/mnt/data/erigon_sync/data/erigon_full/logs file prefix=integration log level=info json=false
[INFO] [07-15|15:55:58.572] [db] open                                label=chaindata sizeLimit=12TB pageSize=8192
[INFO] [07-15|15:56:00.374] [snapshots:all] Stat                     blocks=19800k indices=19800k alloc=2.8GB sys=2.8GB
[INFO] [07-15|15:56:00.375] [snapshots:all] Stat                     blocks=0k indices=0k alloc=2.8GB sys=2.8GB
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.

                 stage_at    prune_at
Snapshots            19799999    0
Headers              19799899    0
BorHeimdall              0       0
BlockHashes              10311398    0
Bodies               19799899    0
Senders              10311398    0
Execution            10311398    10311398
Translation              0       0
HashState            5342210     0
IntermediateHashes       0       0
AccountHistoryIndex          0       0
StorageHistoryIndex          0       0
LogIndex             0       0
CallTraces           0       0
TxLookup             0       40
Finish               0       0
--
prune distance: 

blocks.v2: true, blocks=19799999, segments=19799999, indices=19799999
blocks.bor.v2: segments=0, indices=0

history.v3: false,  idx steps: 0.00, lastBlockInSnap=0, TxNums_Index(0,1)

sequence: EthTx=2397678597, NonCanonicalTx=0

in db: first header 0, last header 0, first body 0, last body 0
--

AskAlexSharov commented 1 month ago

@suxnju no space left on disk?

suxnju commented 1 month ago

@AskAlexSharov Here is the output of my SSD space usage. It seems like there is still 19T available, but I am using multiple SSDs merged into one. Could this be causing the issue?

$ df -h
Filesystem                    Size  Used  Avail Use% Mounted on
/dev/mapper/moss_db-db        3.6T  3.0T  473G  87% /mnt/db
/dev/mapper/moss_data-data     25T  5.6T   19T  24% /mnt/data

AskAlexSharov commented 1 month ago

try https://github.com/ledgerwatch/erigon/issues/10814#issuecomment-2194139697

suxnju commented 1 month ago

@AskAlexSharov Here are some outputs, which also detected errors but different from 10814.

$ du -h /mnt/data/erigon_sync/data/erigon_full
1.3M    /mnt/data/erigon_sync/data/erigon_full/diagnostics
3.6M    /mnt/data/erigon_sync/data/erigon_full/nodes/eth68
4.0M    /mnt/data/erigon_sync/data/erigon_full/nodes/eth67
7.5M    /mnt/data/erigon_sync/data/erigon_full/nodes
531G    /mnt/data/erigon_sync/data/erigon_full/chaindata
19M /mnt/data/erigon_sync/data/erigon_full/downloader
4.6M    /mnt/data/erigon_sync/data/erigon_full/logs
8.6G    /mnt/data/erigon_sync/data/erigon_full/caplin/indexing/beacon_indicies
8.6G    /mnt/data/erigon_sync/data/erigon_full/caplin/indexing
1.3M    /mnt/data/erigon_sync/data/erigon_full/caplin/blobs/chaindata
998M    /mnt/data/erigon_sync/data/erigon_full/caplin/blobs/944
1.1G    /mnt/data/erigon_sync/data/erigon_full/caplin/blobs/947
1.2G    /mnt/data/erigon_sync/data/erigon_full/caplin/blobs/945
1.4G    /mnt/data/erigon_sync/data/erigon_full/caplin/blobs/946
4.5G    /mnt/data/erigon_sync/data/erigon_full/caplin/blobs
14G /mnt/data/erigon_sync/data/erigon_full/caplin
56M /mnt/data/erigon_sync/data/erigon_full/txpool
96M /mnt/data/erigon_sync/data/erigon_full/temp/caplin-forkchoice
1.2M    /mnt/data/erigon_sync/data/erigon_full/temp/erigon-memdb-3307046689
1.1G    /mnt/data/erigon_sync/data/erigon_full/temp
4.0K    /mnt/data/erigon_sync/data/erigon_full/snapshots/accessor
4.0K    /mnt/data/erigon_sync/data/erigon_full/snapshots/history
4.0K    /mnt/data/erigon_sync/data/erigon_full/snapshots/domain
4.0K    /mnt/data/erigon_sync/data/erigon_full/snapshots/idx
586G    /mnt/data/erigon_sync/data/erigon_full/snapshots
1.2T    /mnt/data/erigon_sync/data/erigon_full

$ ./execution/erigon/build/bin/mdbx_stat -ef /mnt/data/erigon_sync/data/erigon_full/chaindata/
mdbx_stat v0.12.0-71-g1cac6536 (2022-07-28T09:57:31+07:00, T-9a6d7e5b917e5fbd14dc51835fa749d092aa1d72)
Running for /mnt/data/erigon_sync/data/erigon_full/chaindata/...
Environment Info
  Pagesize: 8192
  Dynamic datafile: 24576..13194139533312 bytes (+16777216/-33554432), 3..1610612736 pages (+2048/-4096)
  Current mapsize: 13194139533312 bytes, 1610612736 pages 
  Current datafile: 569385156608 bytes, 69505024 pages
  Last transaction ID: 239163
  Latter reader transaction ID: 239163 (0)
  Max readers: 116
  Number of reader slots uses: 1
Garbage Collection
  Pagesize: 8192
  Tree depth: 2
  Branch pages: 1
  Leaf pages: 115
  Overflow pages: 6992
  Entries: 10457
Page Usage
  Total: 1610612736 100%
  Backed: 69505024 4.3%
  Allocated: 69503906 4.3%
  Remained: 1541108830 95.7%
  Used: 55064839 3.4%
  GC: 14439067 0.9%
  Retained: 17 0.0%
  Reclaimable: 14439050 0.9%
  Available: 1555547880 96.6%
Status of Main DB
  Pagesize: 8192
  Tree depth: 2
  Branch pages: 1
  Leaf pages: 2
  Overflow pages: 0
  Entries: 143

$ ./execution/erigon/build/bin/mdbx_chk -0 -d /mnt/data/erigon_sync/data/erigon_full/chaindata/
mdbx_chk v0.12.0-71-g1cac6536 (2022-07-28T09:57:31+07:00, T-9a6d7e5b917e5fbd14dc51835fa749d092aa1d72)
Running for /mnt/data/erigon_sync/data/erigon_full/chaindata/ in 'read-only' mode...
Iterating DBIs...
 - problems: wrong order of entries (1)
 - problems: wrong order of entries (2)
 - interrupted by signal
Total 3 errors are detected, elapsed 7478.411 seconds.

$ ./execution/erigon/build/bin/mdbx_chk -1 -d /mnt/data/erigon_sync/data/erigon_full/chaindata/
mdbx_chk v0.12.0-71-g1cac6536 (2022-07-28T09:57:31+07:00, T-9a6d7e5b917e5fbd14dc51835fa749d092aa1d72)
Running for /mnt/data/erigon_sync/data/erigon_full/chaindata/ in 'read-only' mode...
Iterating DBIs...
 - problems: wrong order of entries (1)
 - problems: wrong order of entries (2)
Total 3 errors are detected, elapsed 11149.777 seconds.

$ ./execution/erigon/build/bin/mdbx_chk -2 -d /mnt/data/erigon_sync/data/erigon_full/chaindata/
mdbx_chk v0.12.0-71-g1cac6536 (2022-07-28T09:57:31+07:00, T-9a6d7e5b917e5fbd14dc51835fa749d092aa1d72)
Running for /mnt/data/erigon_sync/data/erigon_full/chaindata/ in 'read-only' mode...
Iterating DBIs...
 - problems: wrong order of entries (1)
 - problems: wrong order of entries (2)
 - problems: wrong order of multi-values (1)
Total 4 errors are detected, elapsed 11307.339 seconds.

AskAlexSharov commented 1 month ago

Looks like hardware failure:

please use tools like https://www.memtest86.com/ to test RAM and tools like https://www.smartmontools.org/ to test Disk.
Also check if “fsync” is not disabled in OS/FS settings

suxnju commented 1 month ago

@AskAlexSharov Hello, Alex. Just to let you know, based on your suggestions, I used memtest86 and smartmontools and truly detected some issues. I have contacted the manufacturer for product repair and will conduct further testing afterward. Thank you again.

suxnju commented 2 weeks ago

Hello, Alex @AskAlexSharov. Bother you again. I've resolved the RAM issue. However, the process gets stuck at block 19,799,999, and after automatically unwinding, the synchronization faces the same issues.

The first occurrence is as follows:

[INFO] [07-31|12:53:21.352] [4/12 Execution] Blocks execution        from=19700396 to=20424194
[WARN] [07-31|12:53:22.234] [4/12 Execution] Execution failed        block=19700397 hash=0xf82213432d66bb86592b5ff6d5905372bc705956699579b7d101e3a45e543631 err="invalid block: could not apply tx 153 from block 19700397 [0x9e43c239b24cb980ab178142cb9a8c2079e753d7dcc5ac1cb6548a30369ffc0b]: gas limit reached"
[INFO] [07-31|12:53:22.234] [4/12 Execution] Completed on            block=19700396
[WARN] [07-31|12:53:22.243] bad forkchoice                           head=0xd4e56740f876aef8c010b86a40d5f56745a118d0906a34e69aec8c0db1cb8fa3 hash=0x283517b69e93b823e0f56d75be412293b5f3b614ad863c4e3e937a3a98348f10
[INFO] [07-31|12:53:29.183] [Caplin] Forward Sync                    app=caplin stage=ForwardSync progress=9631395 distance-from-chain-tip=5m48s estimated-time-remaining=1m27s

After running [unwind](https://github.com/erigontech/erigon/issues/11044#issuecomment-2227905545), it backs to block 9,850,198 (deleting nearly 10 million blocks).

[EROR] [07-31|15:09:51.924] [9/15 IntermediateHashes] Wrong trie root of block 19700396: 7f69e7484a20d616ef406d93fa7fae76011cb0da6807c2ef57a287b416502041, expected (from header): 31c93f813d0e301fe17b1e4e912d9efaafdf8bb42f282b0240c275ac6f176d30. Block hash: 2343387f198eb232aea0eb17690182938e9577127d59e615c269a36280f1bd05 
[WARN] [07-31|15:09:51.924] Unwinding due to incorrect root hash     to=9850198
[INFO] [07-31|15:09:51.925] [9/15 IntermediateHashes] DONE           in=38m26.533014512s
[INFO] [07-31|15:09:51.926] [8/15 HashState] Unwinding started       from=19700396 to=9850198 storage=false codes=true

When I started again, numerous bad blocks appeared,

[WARN] [08-19|04:14:10.333] bad blocks segment received              err="replay block, code: 2"

and the process stopped executing blocks after reaching 19,799,999.

Here is the full log: erigon_ex.log

This is the result of print_stages:

[INFO] [08-19|13:10:17.047] logging to file system                   log dir=/mnt/data/erigon_sync/data/erigon_full_3/logs file prefix=integration log level=info json=false
[INFO] [08-19|13:10:17.056] [db] open                                label=chaindata sizeLimit=12TB pageSize=8192
[INFO] [08-19|13:10:19.658] [snapshots:all] Stat                     blocks=19800k indices=19800k alloc=2.7GB sys=2.8GB
[INFO] [08-19|13:10:19.660] [snapshots:all] Stat                     blocks=0k indices=0k alloc=2.7GB sys=2.8GB
Note: prune_at doesn't mean 'all data before were deleted' - it just mean stage.Prune function were run to this block. Because 1 stage may prune multiple data types to different prune distance.

                 stage_at    prune_at
Snapshots            19799999    0
Headers              19799999    0
BorHeimdall              0       0
BlockHashes              19799999    0
Bodies               19799999    0
Senders              19799999    0
Execution            19799999    9899999
Translation              0       0
HashState            9899999     0
IntermediateHashes       0       0
AccountHistoryIndex          0       0
StorageHistoryIndex          0       0
LogIndex             0       0
CallTraces           0       0
TxLookup             0       20
Finish               0       0
--
prune distance: 

blocks.v2: true, blocks=19799999, segments=19799999, indices=19799999
blocks.bor.v2: segments=0, indices=0

history.v3: false,  idx steps: 0.00, lastBlockInSnap=0, TxNums_Index(0,1)

sequence: EthTx=2506244563, NonCanonicalTx=0

in db: first header 19800000, last header 20466745, first body 19800000, last body 20466745
--

And this is the snapshot from dashboard:

somnathb1 commented 1 week ago

May I suggest restarting with the latest release - https://github.com/erigontech/erigon/releases/tag/v2.60.6

AskAlexSharov commented 1 week ago

you can try:

integration stage_hash_state --reset
integration stage_hash_state
integration stage_trie

also it feels like you do run erigon on one datadir, but integration on another datadir:

Blocks execution        from=19700396 to=20424194

Execution            19799999

also please switch to latest erigon version. tnx