NethermindEth / nethermind

A robust execution client for Ethereum node operators.
https://nethermind.io/nethermind-client
GNU General Public License v3.0
1.25k stars 430 forks source link

Unable to load a 1GB genesis file in 40 seconds in version 1.28.0. #7361

Open lyfsn opened 1 month ago

lyfsn commented 1 month ago

Description Our custom network uses a large 1GB genesis.json file, and it worked fine with versions before 1.28.0, such as 1.27.x.

However, after upgrading to version 1.28.0, my Nethermind node can't start and encountered this error:

26 Aug 02:44:18 | Snap serving enabled, but PruningBoundary is less than 128. Setting to 128. 
26 Aug 02:45:39 | Step LoadGenesisBlock         failed after 80976ms System.TimeoutException: Genesis block was not processed after 40 seconds
   at Nethermind.Init.Steps.LoadGenesisBlock.Load(IWorldState worldState) in /src/Nethermind/Nethermind.Init/Steps/LoadGenesisBlock.cs:line 88
   at Nethermind.Init.Steps.LoadGenesisBlock.Execute(CancellationToken _) in /src/Nethermind/Nethermind.Init/Steps/LoadGenesisBlock.cs:line 46
   at Nethermind.Init.Steps.EthereumStepsManager.ExecuteStep(IStep step, StepInfo stepInfo, CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 153
   at Nethermind.Init.Steps.EthereumStepsManager.InitializeAll(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 95
   at Nethermind.Runner.Ethereum.EthereumRunner.Start(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Runner/Ethereum/EthereumRunner.cs:line 36
   at Nethermind.Runner.Program.<>c__DisplayClass8_0.<<Run>b__1>d.MoveNext() in /src/Nethermind/Nethermind.Runner/Program.cs:line 213
26 Aug 02:45:39 | Error during ethereum runner start System.TimeoutException: Genesis block was not processed after 40 seconds
   at Nethermind.Init.Steps.LoadGenesisBlock.Load(IWorldState worldState) in /src/Nethermind/Nethermind.Init/Steps/LoadGenesisBlock.cs:line 88
   at Nethermind.Init.Steps.LoadGenesisBlock.Execute(CancellationToken _) in /src/Nethermind/Nethermind.Init/Steps/LoadGenesisBlock.cs:line 46
   at Nethermind.Init.Steps.EthereumStepsManager.ExecuteStep(IStep step, StepInfo stepInfo, CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 153
   at Nethermind.Init.Steps.EthereumStepsManager.InitializeAll(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Init/Steps/EthereumStepsManager.cs:line 95
   at Nethermind.Runner.Ethereum.EthereumRunner.Start(CancellationToken cancellationToken) in /src/Nethermind/Nethermind.Runner/Ethereum/EthereumRunner.cs:line 36
   at Nethermind.Runner.Program.<>c__DisplayClass8_0.<<Run>b__1>d.MoveNext() in /src/Nethermind/Nethermind.Runner/Program.cs:line 213

Steps to Reproduce

  1. Generate a large genesis file of 1GB.
  2. Use this large genesis file to initialize and start the node.

Actual behavior The node can't start and logs a timeout of 40 seconds.

By the way, why is the 40s timeout hardcoded? https://github.com/NethermindEth/nethermind/blob/e856de5a33259ea0e54c40c28db37631bf56c2c0/src/Nethermind/Nethermind.Init/Steps/LoadGenesisBlock.cs#L23

Expected behavior The node can start normally, just like in version 1.27.x.

Screenshots

Screenshot 2024-08-26 at 10 57 09

Desktop (please complete the following information): Please provide the following information regarding your setup:

Additional context In my more precise testing, if the genesis file size exceeds 256MB, the node fails to start and times out while loading the genesis file.

My startup paramaters:

version: "3.9"
services:
  execution:
    tty: true
    environment:
    - TERM=xterm-256color
    - COLORTERM=truecolor
    stop_grace_period: 30s
    container_name: gas-execution-client
    image: ${EC_IMAGE_VERSION}
    networks:
    - gas
    volumes:
    - ${EC_DATA_DIR}:/nethermind/data
    - ${EC_JWT_SECRET_PATH}:/tmp/jwt/jwtsecret
    - ${CHAINSPEC_PATH}:/tmp/chainspec/chainspec.json
    ports:
    - "30304:30304/tcp"
    - "30304:30304/udp"
    - "8009:8009"
    - "8545:8545"
    - "8551:8551"
    expose:
    - 8545
    - 8551
    command:
    - --config=none.cfg
    - --Init.ChainSpecPath=/tmp/chainspec/chainspec.json
    - --datadir=/nethermind/data
    - --log=INFO
    - --JsonRpc.Enabled=true
    - --JsonRpc.Host=0.0.0.0
    - --JsonRpc.Port=8545
    - --JsonRpc.JwtSecretFile=/tmp/jwt/jwtsecret
    - --JsonRpc.EngineHost=0.0.0.0
    - --JsonRpc.EnginePort=8551
    - --Network.DiscoveryPort=30304
    - --HealthChecks.Enabled=true
    - --Metrics.Enabled=true
    - --Metrics.ExposePort=8009
    - --Sync.MaxAttemptsToUpdatePivot=0
    logging:
      driver: json-file
      options:
        max-size: 10m
        max-file: "10"
networks:
  gas:
    name: gas-network

Logs

LukaszRozmej commented 1 month ago

Can you share genesis file you are using?

lyfsn commented 1 month ago

Can you share genesis file you are using?

In my test environment, I generate a random genesis file every time using this script, which creates many accounts in one genesis file.

For a quick test, this is a larger than 800MB genesis file of Endurance's mainnet. You could also try using this file: (But I haven't tried this file to see if it will produce the error. My error comes from the script method mentioned above.) https://github.com/OpenFusionist/network_config

ohko4711 commented 2 weeks ago

hi @LukaszRozmej For the above mentioned performance regression, I've done a further investigation and have some conclusions and points I'd like to further discuss

Regarding the performance issue: PR: https://github.com/NethermindEth/nethermind/pull/7215 was a performance optimization that replaced LruCache with ClockCache to reduce lock granularity. However, due to implementation details, it caused a regression that led to timeout issues when initializing large genesis files (>800M). The latest commit (60159fb448d5b7fd53565aa7b15942a8c68614ba) appears to have fixed this issue based on our tests.

Issue identification method:

Regarding the 40s hard-coded timeout: this has been previously discussed.Related PR: https://github.com/NethermindEth/nethermind/pull/6160. We can further discuss this issue:

It's up for discussion if we want to increase the timeout from 40 seconds (current default, hard-coded value) to something different.

Let me know if you need any additional information or clarification on this matter.

LukaszRozmej commented 2 weeks ago

@ohko4711 thank you for the analysis. #7215 might have some unplanned effect though https://github.com/NethermindEth/nethermind/commit/60159fb448d5b7fd53565aa7b15942a8c68614ba shouldn't affect genesis based on the code, so not sure if it was this that could fix it. @benaadams can you check, both are your changes.

I will move the timeout to config though.