ChainSafe / lodestar

🌟 TypeScript Implementation of Ethereum Consensus
https://lodestar.chainsafe.io
Apache License 2.0
1.2k stars 300 forks source link

Lodestar CL < > Nimbus VC incompatibility #6634

Closed barnabasbusa closed 5 months ago

barnabasbusa commented 8 months ago

Describe the bug

We are in the process to test cross beacon <> validator client compatibility, and found a bug when testing lodestar CL with Nimbus VC.

Nimbus reports:

INF 2024-04-05 14:24:30.001+00:00 Slot start                                 slot=26 epoch=0 attestationIn=16s blockIn= validators=64 node_status=synced delay=1ms448us381ns
ERR 2024-04-05 14:24:30.072+00:00 Beacon node provides unexpected response   reason="Serialization error;200;produceBlockV3(best);unexpected-data" node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
WRN 2024-04-05 14:24:30.072+00:00 Unable to retrieve block data              reason="Serialization error;200;produceBlockV3(best);unexpected-data" wall_slot=26 validator_index=85 service=block_service slot=26 validator=b0b263c2@85
NTC 2024-04-05 14:24:30.072+00:00 Block production using V3 failed, trying V2 wall_slot=26 validator_index=85 service=block_service slot=26 validator=b0b263c2@85
NTC 2024-04-05 14:24:30.863+00:00 Beacon node is online                      agent_version=Lodestar/v1.17.0/898cd90 node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
NTC 2024-04-05 14:24:30.866+00:00 Beacon node is compatible                  node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
NTC 2024-04-05 14:24:30.868+00:00 Beacon node is in sync                     head_slot=23 sync_distance=0 is_optimistic=false node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
ERR 2024-04-05 14:24:30.882+00:00 Beacon node provides unexpected response   reason="Serialization error;200;produceBlockV2(best);unexpected-data" node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
WRN 2024-04-05 14:24:30.882+00:00 Unable to retrieve block data              reason="Serialization error;200;produceBlockV2(best);unexpected-data" wall_slot=26 service=block_service slot=26 validator=b0b263c2@85
NTC 2024-04-05 14:24:32.873+00:00 Beacon node is online                      agent_version=Lodestar/v1.17.0/898cd90 node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
NTC 2024-04-05 14:24:32.878+00:00 Beacon node is compatible                  node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
NTC 2024-04-05 14:24:32.881+00:00 Beacon node is in sync                     head_slot=23 sync_distance=0 is_optimistic=false node=http://172.16.0.31:8562[Lodestar/v1.17.0/898cd90] node_index=0 node_roles=AGBSDT
NTC 2024-04-05 14:24:34.231+00:00 Sync committee message published           validator_index=98 delay=231ms206us633ns duration=217ms69us375ns message="(slot: 26, beacon_block_root: \"581b9a2f\", validator_index: 98, signature: \"814ebb83\")" service=sync_committee_service slot=26 validator=a9fdcd17@98 block_root=581b9a2f

Lodestar reports:

Apr-05 14:20:00.472[rest]            error: Req req-r produceBlockV2 error - REGEN_ERROR_SLOT_BEFORE_BLOCK_SLOT
Error: REGEN_ERROR_SLOT_BEFORE_BLOCK_SLOT
    at StateRegenerator.getBlockSlotState (file:///usr/app/packages/beacon-node/src/chain/regen/regen.ts:114:13)
    at JobItemQueue.QueuedStateRegenerator.jobQueueProcessor [as itemProcessor] (file:///usr/app/packages/beacon-node/src/chain/regen/queued.ts:288:35)
    at Timeout.JobItemQueue.runJob [as _onTimeout] (file:///usr/app/packages/beacon-node/src/util/queue/itemQueue.ts:101:33)
    at listOnTimeout (node:internal/timers:573:17)
    at processTimers (node:internal/timers:514:7)
Apr-05 14:20:12.000[]                 info: Synced - slot: 4 - head: 0x9cc3…383b - exec-block: valid(2 0xf3b5…) - finalized: 0x0000…0000:0 - peers: 5
Apr-05 14:20:18.027[api]              info: Selected engine block: no builder block produced fork=deneb, builderSelection=maxprofit, slot=5, isBuilderEnabled=false, isEngineEnabled=true, strictFeeRecipientCheck=undefined, builderBoostFactor=100, durationMs=11, executionPayloadValue=0.00000 ETH, consensusBlockValue=0.00087 ETH, blockTotalValue=0.00087 ETH
Apr-05 14:20:24.004[]                 info: Synced - slot: 5 - head: (slot -1) 0x9cc3…383b - exec-block: valid(2 0xf3b5…) - finalized: 0x0000…0000:0 - peers: 5
Apr-05 14:20:26.014[rest]            error: Req req-5p getAggregatedAttestation error - No attestation for slot=5 dataRoot=0xd947df35713911cac72a977556a795d217eefc7a66bfded54347e18469c675dd
Error: No attestation for slot=5 dataRoot=0xd947df35713911cac72a977556a795d217eefc7a66bfded54347e18469c675dd
    at AttestationPool.getAggregate (file:///usr/app/packages/beacon-node/src/chain/opPools/attestationPool.ts:136:13)
    at Object.getAggregatedAttestation (file:///usr/app/packages/beacon-node/src/api/impl/validator/index.ts:1070:47)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at Object.handler (file:///usr/app/packages/api/src/utils/server/genericJsonServer.ts:45:23)

Lodestar seems to be able to submit sync committee signatures, but not block proposals

Snooper between cl <> vc reports:

Nothing out of the ordinary

Expected behavior

I would expect all client combinations would work.

Steps to reproduce

config.yaml:

participants:
  - cl_type: lodestar
    cl_image: chainsafe/lodestar:next
    vc_type: nimbus
additional_services:
  - dora
snooper_enabled: true
global_log_level: debug

kurtosis run github.com/kurtosis-tech/ethereum-package --args-file config.yaml

Additional context

Current BN <> VC Compatibility list tracker

Screenshot 2024-04-05 at 14 08 53

Operating system

Linux

Lodestar version or commit hash

Version: v1.17.0/898cd90

cc: @pk910

barnabasbusa commented 8 months ago

Nimbus reports doppelganger detection:

NTC 2024-04-05 14:40:50.011+00:00 Doppelganger detection active - skipping validator duties while observing the network topics="val_pool" validator=867e8956 slot=107 doppelCheck=none() activationEpoch=0
INF 2024-04-05 14:40:54.000+00:00 Slot start                                 slot=108 epoch=3 attestationIn=4s blockIn=1m validators=64 node_status=synced delay=171us45ns
tersec commented 8 months ago

Is this just the doppelganger? That's intentional. Does --doppelganger-detection=off work?

tersec commented 8 months ago

Nimbus doesn't use doppelganger protection on sync committees because they're not slashable.

nflaig commented 8 months ago

@tersec regarding the getAggregate error, also reported in https://github.com/ChainSafe/lodestar/issues/6631, is Nimbus by any chance requesting an aggregate attestation even if no validator is aggregator? We had the same issue with Lighthouse previously and it was fixed on their end https://github.com/sigp/lighthouse/issues/4712

See https://github.com/ChainSafe/lodestar/issues/6634#issuecomment-2042672443

barnabasbusa commented 8 months ago

even with --doppelganger-detection=off I'm getting:

ERR 2024-04-05 18:03:31.097+00:00 Beacon node provides unexpected response   reason="Serialization error;200;produceBlockV3(best);unexpected-data" node=http://172.16.0.31:8562[Lodestar/v1.17.0/f2ec0d4] node_index=0 node_roles=AGBSDT
WRN 2024-04-05 18:03:31.097+00:00 Unable to retrieve block data              reason="Serialization error;200;produceBlockV3(best);unexpected-data" wall_slot=28 validator_index=78 service=block_service slot=28 validator=8890da28@78
nflaig commented 8 months ago

even with --doppelganger-detection=off I'm getting:

ERR 2024-04-05 18:03:31.097+00:00 Beacon node provides unexpected response   reason="Serialization error;200;produceBlockV3(best);unexpected-data" node=http://172.16.0.31:8562[Lodestar/v1.17.0/f2ec0d4] node_index=0 node_roles=AGBSDT
WRN 2024-04-05 18:03:31.097+00:00 Unable to retrieve block data              reason="Serialization error;200;produceBlockV3(best);unexpected-data" wall_slot=28 validator_index=78 service=block_service slot=28 validator=8890da28@78

The "unexpected data" might be because we are returning execution_payload_source metadata field in the response as we allow to produce a blinded local block but this is field is not officially part of the block v3 api and was only proposed in https://github.com/ethereum/beacon-APIs/issues/387 but not included.

See https://github.com/ChainSafe/lodestar/issues/6634#issuecomment-2042600225

nflaig commented 7 months ago

The "unexpected data" might be because we are returning execution_payload_source metadata field in the response

After reviewing how Nimbus handles this case it does not seem to be the issue as it only uses metadata provided in headers which Lodestar is setting and only parses the response.data

beacon_chain/validator_client/api.nim#L2145

res = decodeBytes(ProduceBlockResponseV3, response.data, response.contentType, version, blinded, executionValue, consensusValue)

And even there they seem to allow unknown fields

beacon_chain/spec/eth2_apis/eth2_rest_serialization.nim#L3950-L3952

ok(RestJson.decode(value, T, requireAllFields = true, allowUnknownFields = true))

Based on just reviewing Lodestar and Nimbus code it's hard to tell what's the actual issue, would need to inspect the data we are sending the further investigate this

nflaig commented 7 months ago

@tersec regarding the getAggregate error, also reported in #6631, is Nimbus by any chance requesting an aggregate attestation even if no validator is aggregator? We had the same issue with Lighthouse previously and it was fixed on their end sigp/lighthouse#4712

This is not the case, Nimbus will only request an aggregated attestation if at least one validator is aggregator

beacon_chain/validator_client/attestation_service.nim#L280

if len(aggregateItems) > 0:

But I think I know what the issue here is, Nimbus uses different strategies per API, and if I interpreted them correctly this is why we are seeing the error:

The submitPoolAttestations is only sent to the first / primary beacon node (unless an error is encountered)

beacon_chain/validator_client/attestation_service.nim#L66

await vc.submitPoolAttestations(@[attestation], ApiStrategyKind.First)

While the getAggregatedAttestation is sent to all connected beacon nodes and the "best" response is picked

beacon_chain/validator_client/attestation_service.nim#L283-L284

await vc.getAggregatedAttestation(slot, attestationRoot, ApiStrategyKind.Best)

But Lodestar might not have an aggregated attestation in it's cache if the validator client did not previously submit an attestation for attestation_data_root. Although this error might be not happen consistently as the beacon node might prepare the aggregate due to receiving the attestation via gossip (gossipHandlers.ts#L490-L492).

We have just improved the error handling if there is no aggregated attestation available https://github.com/ChainSafe/lodestar/pull/6648 to follow spec and produce a less nosiy error as this is somewhat expected to happen in a setup with multiple nodes.

If my analysis is correct, this is also not an issue as Nimbus would still produce the aggregate attestation if it received the data from the primary node.

@barnabasbusa this should not happen in a 1:1 setup without fallback node, I am assuming in your tests you had multiple bns connected to a single vc

barnabasbusa commented 7 months ago

@nflaig We have 1:1 setups only. We haven't considered any testing for 1:x testing.

nflaig commented 7 months ago

@nflaig We have 1:1 setups only. We haven't considered any testing for 1:x testing.

Then I have no idea why this happens but something to note is that there is also a very strange error which should never happen

Apr-05 14:20:00.472[rest]            error: Req req-r produceBlockV2 error - REGEN_ERROR_SLOT_BEFORE_BLOCK_SLOT
Error: REGEN_ERROR_SLOT_BEFORE_BLOCK_SLOT

So maybe there was just a clock issue / time skew between the validator client and beacon node?

barnabasbusa commented 7 months ago

All tests are ran on the same (docker backend) machine, and everyone should be able to replicate it with the config I have posted above. Doubt that there would be any sort of clock issues between two containers hosted on the same physical node.

nflaig commented 7 months ago

and everyone should be able to replicate it with the config I have posted above.

Will have to do this next, thanks for the detailed infos

nflaig commented 7 months ago

Apr-05 14:20:26.014[rest] error: Req req-5p getAggregatedAttestation error - No attestation for slot=5 dataRoot=0xd947df35713911cac72a977556a795d217eefc7a66bfded54347e18469c675dd Error: No attestation for slot=5 dataRoot=0xd947df35713911cac72a977556a795d217eefc7a66bfded54347e18469c675dd

The aggregated attestation issue has been fixed by https://github.com/ChainSafe/lodestar/pull/6668. There still seems to be strange behavior by Nimbus VC that it requests an aggregated attestation for the first 1-5 slots even though it does not submit subnet subscriptions with is_aggregator=true for those slots beforehand. But other than that looks good, attestations, aggregates, and sync committee works as expected.

The only remaining issue is block production, my best guess right now is that Nimbus does not like that Lodestar returns a JSON payload from produceBlockV3 API.

nflaig commented 6 months ago

The only remaining issue is block production, my best guess right now is that Nimbus does not like that Lodestar returns a JSON payload from produceBlockV3 API.

Confirmed this issue is related us return a JSON payload in response

ERR 2024-05-29 13:37:55.640+00:00 Beacon node provides unexpected response   reason="Serialization error;200;produceBlockV3(best);unexpected-data" node=http://172.16.0.15:4000[Lodestar/v1.18.1/8b6ecc4] node_index=0 node_roles=AGBSDT
WRN 2024-05-29 13:37:55.645+00:00 Unable to retrieve block data              reason="Serialization error;200;produceBlockV3(best);unexpected-data" wall_slot=12 validator_index=53 service=block_service slot=12 validator=b09cb155@53

Block production with Nimbus VC will be fixed once we merge https://github.com/ChainSafe/lodestar/pull/6749 but there is another issue publishing blocks as SSZ (while JSON seems work fine there). Does not seem to be a problem with other clients and there is an issue on the Nimbus side to track this https://github.com/status-im/nimbus-eth2/issues/6205 <-- this is an issue on Nimbus BN