aptos-labs / aptos-core

Aptos is a layer 1 blockchain built to support the widespread use of blockchain through better technology and user experience.
https://aptosfoundation.org
Other
5.89k stars 3.57k forks source link

[Bug] [v1.13.1] Aptos Fullnode's synchronization consistently fails. #13609

Open jinseon2416 opened 1 month ago

jinseon2416 commented 1 month ago

πŸ› Bug

We are continuously collecting block information through the rpc API while synchronizing the archive fullnode. After recently upgrading to the latest aptos version v1.13.1, synchronization stops frequently. I downgraded to 1.12.1 and the same symptom does not occur. Also, If you simply synchronize without querying the rpc API, the same symptom does not appear.

If synchronization is in progress and then stopped, only the following log is output, and synchronization-related logs are not output. If I run the aptos process again, the sync will proceed normally again and then stop again.

Stack trace/error message

Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:10.038956Z [data-client-0] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:10.038975Z [data-client-0] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:15.038523Z [data-client-0] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:15.038541Z [data-client-0] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:19.078768Z [data-client-0] INFO state-sync/aptos-data-client/src/poller.rs:444 {"event":"aggregate_summary","message":"Global data summary: AdvertisedData { Highest epoch ending ledger info, epoch: Some(7285). Highest synced ledger info, epoch: Some(7286), version: Some(970225494). Lowest transaction version: Some(820220494), Lowest transaction output version: Some(820220494), Lowest states version: Some(890225440) }, OptimalChunkSizes { epoch_chunk_size: 200, state_chunk_size: 4000, transaction_chunk_size: 2000, transaction_output_chunk_size: 1000 }","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:20.039162Z [data-client-30] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:20.039180Z [data-client-30] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:25.038909Z [data-client-31] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:25.038928Z [data-client-31] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:25.791820Z [peer-mon-27] WARN peer-monitoring-service/client/src/peer_states/network_info.rs:145 {"event":"invalid_response","message":"Peer returned invalid depth from validators: 1","name":"network_info_request","peer":"Public:568fdb6a"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:29.073928Z [data-client-0] INFO state-sync/aptos-data-client/src/poller.rs:444 {"event":"aggregate_summary","message":"Global data summary: AdvertisedData { Highest epoch ending ledger info, epoch: Some(7285). Highest synced ledger info, epoch: Some(7286), version: Some(970225760). Lowest transaction version: Some(820221010), Lowest transaction output version: Some(820221010), Lowest states version: Some(890225707) }, OptimalChunkSizes { epoch_chunk_size: 200, state_chunk_size: 4000, transaction_chunk_size: 2000, transaction_output_chunk_size: 1000 }","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:30.038314Z [data-client-0] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:30.038334Z [data-client-0] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:35.038927Z [data-client-0] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:35.038945Z [data-client-0] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:36.573597Z [telemetry-27] WARN /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/reqwest-retry-0.2.3/src/middleware.rs {"message":"Retry attempt #0. Sleeping 264.604842ms before the next attempt"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:37.037511Z [telemetry-0] WARN /home/ubuntu/.cargo/registry/src/index.crates.io-6f17d22bba15001f/reqwest-retry-0.2.3/src/middleware.rs {"message":"Retry attempt #1. Sleeping 8.210725312s before the next attempt"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:39.311962Z [data-client-31] INFO state-sync/aptos-data-client/src/poller.rs:444 {"event":"aggregate_summary","message":"Global data summary: AdvertisedData { Highest epoch ending ledger info, epoch: Some(7285). Highest synced ledger info, epoch: Some(7286), version: Some(970226065). Lowest transaction version: Some(820221010), Lowest transaction output version: Some(820221010), Lowest states version: Some(890225967) }, OptimalChunkSizes { epoch_chunk_size: 200, state_chunk_size: 4000, transaction_chunk_size: 2000, transaction_output_chunk_size: 1000 }","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:40.039141Z [data-client-30] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}
Jun 07 19:41:45 dp-aptos-mainnet aptos-fullnode[144386]: 2024-06-07T19:41:40.039159Z [data-client-30] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}

Expected Behavior

It is necessary to determine what differences between versions 1.12.1 and 1.13.1 are causing the synchronization stoppage.

System information

Please complete the following information:

Additional context

I encountered an issue after upgrading to a new mainnet release. The same symptom does not appear in version 1.12.1.

p1xel32 commented 3 weeks ago

May I ask you @jinseon2416 is it was bootstrapped from snapshot?

JoshLind commented 3 weeks ago

Hmm... this is unusual. We haven't seen any other reports of this.

Can you please: (i) stop the fullnode; (ii) restart the fullnode and collect the logs (e.g., write them to a file); (iii) let the node run for a few minutes (e.g., 5 minutes?); and (iv) upload the entire log file of the node. This will hopefully show us the error the node is running into.

Note: it's necessary to have the entire log file from reboot so that we can help debug πŸ˜„

m229abd commented 3 weeks ago

Hi there. Here to report another instance of this. I have been in contact with @p1xel32 in discord. I tried Fast sync, then went on to try bootstrapping. Contacted Polkachu team and they can't update the snapshots for bootstrapping neither. BwareLabs snapshot is also lagging behind schedule. I've attached the full log since initial pulling of the docker image. log.txt

p1xel32 commented 3 weeks ago

@m229abd We have identified several users experiencing similar issues. For those who re-synced using the default configuration, the problem was resolved. If you're encountering character and connectivity issues, especially after bootstrapping your nodes, re-syncing with the default configuration might be beneficial.

Default config you could find below: https://github.com/aptos-labs/aptos-core/blob/main/config/src/config/test_data/public_full_node.yaml

@JoshLind I believe the original issue is here: https://github.com/aptos-labs/aptos-core/issues/13553

jinseon2416 commented 3 weeks ago

We performed bootstrapping using aptos foundation backup storage snapshots. https://aptos.dev/nodes/full-node/aptos-db-restore

What we need now is an archive mainnet. When running with the default configuration, the entire transaction cannot be saved and executed, so the state sync setting is used as follows. https://aptos.dev/guides/state-sync

Are there any other alternatives?

state_sync: state_sync_driver: bootstrapping_mode: ExecuteTransactionsFromGenesis continuous_syncing_mode: ExecuteTransactions

jinseon2416 commented 2 weeks ago

We share logs that occur when data synchronization stops. It seems to occur when DataStreamNotificationTimeout":"5s" is exceeded. Are there any recommendations?

2024-06-21T02:51:09.284649Z [sync-driver-1] WARN state-sync/state-sync-driver/src/continuous_syncer.rs:195 Resetting the currently active data stream due to too many timeouts!

2024-06-21T02:51:09.284684Z [sync-driver-1] INFO state-sync/state-sync-driver/src/utils.rs:247 {"message":"Terminating the data stream with ID: 1, notification and feedback: None","name":"driver"}

2024-06-21T02:51:09.284718Z [sync-driver-1] WARN state-sync/state-sync-driver/src/driver.rs:631 {"error":{"CriticalDataStreamTimeout":"12"},"message":"Error found when driving progress of the continuous syncer!","name":"driver"}

2024-06-21T02:51:09.284782Z [stream-serv-9] INFO state-sync/data-streaming-service/src/streaming_service.rs:222 {"event":"success","message":"Terminating the data stream with ID: 1. Notification and feedback: None","name":"handle_terminate_request","stream_id":1}

2024-06-21T02:51:09.291146Z [stream-serv-9] INFO state-sync/data-streaming-service/src/streaming_service.rs:297 {"event":"success","message":"Stream created for request: StreamRequestMessage { stream_request: ContinuouslyStreamTransactionsOrOutputs(ContinuouslyStreamTransactionsOrOutputsRequest { known_version: 992582227, known_epoch: 7451, include_events: false, target: None }), response_sender: Sender { complete: false } }","name":"handle_stream_request","stream_id":2}

2024-06-21T02:51:09.336301Z [stream-serv-1] INFO state-sync/data-streaming-service/src/streaming_service.rs:371 {"event":"success","message":"Data stream initialized.","name":"initialize_stream","stream_id":2}

2024-06-21T02:51:14.036718Z [data-client-9] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}

2024-06-21T02:51:14.036834Z [data-client-9] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}

2024-06-21T02:51:14.137172Z [data-client-9] INFO state-sync/aptos-data-client/src/poller.rs:444 {"event":"aggregate_summary","message":"Global data summary: AdvertisedData { Highest epoch ending ledger info, epoch: Some(7450). Highest synced ledger info, epoch: Some(7451), version: Some(992584475). Lowest transaction version: Some(842581461), Lowest transaction output version: Some(842581461), Lowest states version: Some(912584460) }, OptimalChunkSizes { epoch_chunk_size: 200, state_chunk_size: 4000, transaction_chunk_size: 2000, transaction_output_chunk_size: 1000 }","name":"peer_states"}

2024-06-21T02:51:14.292451Z [sync-driver-1] WARN state-sync/state-sync-driver/src/driver.rs:631 {"error":{"DataStreamNotificationTimeout":"5s"},"message":"Error found when driving progress of the continuous syncer!","name":"driver"}

2024-06-21T02:51:19.034046Z [data-client-14] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}

2024-06-21T02:51:19.034076Z [data-client-14] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}

2024-06-21T02:51:19.293671Z [sync-driver-1] WARN state-sync/state-sync-driver/src/driver.rs:631 {"error":{"DataStreamNotificationTimeout":"5s"},"message":"Error found when driving progress of the continuous syncer!","name":"driver"}

jinseon2416 commented 1 week ago

The same synchronization interruption is occurring in the latest version v1.15.2.

2024-06-28T04:31:30.602374Z [sync-driver-14] WARN state-sync/state-sync-driver/src/bootstrapper.rs:593 Resetting the currently active data stream due to too many timeouts!
2024-06-28T04:31:30.602398Z [sync-driver-14] INFO state-sync/state-sync-driver/src/utils.rs:247 {"message":"Terminating the data stream with ID: 24, notification and feedback: None","name":"driver"}
2024-06-28T04:31:30.602440Z [sync-driver-14] WARN state-sync/state-sync-driver/src/driver.rs:657 {"error":{"CriticalDataStreamTimeout":"12"},"message":"Error found when checking the bootstrapper progress!","name":"driver"}
2024-06-28T04:34:30.649098Z [stream-serv-5] INFO state-sync/data-streaming-service/src/streaming_service.rs:297 {"event":"success","message":"Stream created for request: StreamRequestMessage { stream_request: GetAllTransactionsOrOutputs(GetAllTransactionsOrOutputsRequest { start_version: 1001178912, end_version: 1001288966, proof_version: 1003283292, include_events: false }), response_sender: Sender { complete: false } }","name":"handle_stream_request","stream_id":28}
2024-06-28T04:34:30.655387Z [stream-serv-0] INFO state-sync/data-streaming-service/src/streaming_service.rs:371 {"event":"success","message":"Data stream initialized.","name":"initialize_stream","stream_id":28}
2024-06-28T04:34:32.103636Z [network-Pub-0] INFO network/framework/src/peer_manager/mod.rs:223 Current connected peers {"network_context":{"role":"full_node","network_id":"Public","peer_id":"6f7a88ac44b154da932976a5358dddb478a19066b879f8ca322d45d7ddadc46c"},"peers":"[(371dfdee9fbecaa358683936c5946aa28393067fa60e1e185c4250da461e15cf, /dns/vfn.aptos.dsrvlabs.net/tcp/6182/noise-ik/0xdf0eb85221ed6f116dd5caa1e7461d2ec2e3c6be4b252ac90246b3384e46d96d/handshake/0, outbound), (c41e4eded759ab4a31e999bbef6d14361fd94bea59db8310703b1668af98a23d, /dns/aptos-vfn-en.marblex.io/tcp/6182/noise-ik/0xdac7e141cc34fe3f06f40f54e3b932eeacd5b81631e18c0775a84f03959d953c/handshake/0, outbound), (2fa9d01fb4a1c2fdac5f45419b98213604bc215817d4b394ab372e5d59d2753c, /dns/aptos-staked.prod-eks-ap-northeast-2.staked.cloud/tcp/6182/noise-ik/0xbcfd67cf425a8633620fd848c9446a69d34c601d2d73bb864c86584f53012f66/handshake/0, outbound), (0a6eb54c274da96778a6702740a763f6aa7b02eef3d1efc2623d28289d185118, /ip4/47.91.19.61/tcp/6182/noise-ik/0x7d9348a4e93a17dbb73b511807710415ebe2a6c2c36ab404e2d63174407e625b/handshake/0, outbound), (568fdb6acf26aae2a84419108ff13baa3ebf133844ef18e23a9f47b5af16b698, /dns/node1.cloud-b.mainnet.aptoslabs.com/tcp/6182/noise-ik/0x003cc2ed36e7d486539ac2c411b48d962f1ef17d884c3a7109cad43f16bd5008/handshake/0, outbound), (947c6451ad9a8659cba4e60a6da41086eb66e73fc07936bd3b7044873b5e094d, /dns/vfn.mainnet.aptos.despread.io/tcp/6182/noise-ik/0xd4eabfa79a26a59f5701397277556c4f1343e4892bfecf13ef9401c1650f317a/handshake/0, outbound)]"}
2024-06-28T04:34:33.107022Z [data-client-12] INFO state-sync/aptos-data-client/src/poller.rs:444 {"event":"aggregate_summary","message":"Global data summary: AdvertisedData { Highest epoch ending ledger info, epoch: Some(7535). Highest synced ledger info, epoch: Some(7536), version: Some(1003302786). Lowest transaction version: Some(853298253), Lowest transaction output version: Some(853298253), Lowest states version: Some(923302743) }, OptimalChunkSizes { epoch_chunk_size: 200, state_chunk_size: 4000, transaction_chunk_size: 2000, transaction_output_chunk_size: 1000 }","name":"peer_states"}
2024-06-28T04:34:34.003254Z [data-client-12] INFO state-sync/aptos-data-client/src/latency_monitor.rs:239 {"event":"waiting_for_catchup","message":"Waiting for the node to catch up to the latest version before starting the latency monitor.","name":"latency_monitor"}
2024-06-28T04:34:35.003223Z [data-client-12] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}
2024-06-28T04:34:35.003247Z [data-client-12] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}
2024-06-28T04:34:35.650437Z [sync-driver-0] WARN state-sync/state-sync-driver/src/driver.rs:657 {"error":{"DataStreamNotificationTimeout":"5s"},"message":"Error found when checking the bootstrapper progress!","name":"driver"}
2024-06-28T04:34:40.003748Z [data-client-12] INFO state-sync/aptos-data-client/src/client.rs:1190 {"event":"priority_and_regular_peers","message":"Number of priority peers: 6. Number of regular peers: 0","name":"peer_states"}
2024-06-28T04:34:40.003774Z [data-client-12] INFO state-sync/aptos-data-client/src/client.rs:1223 {"event":"priority_peer_categories","message":"Number of connected peers by priority: {HighPriority: 6}","name":"peer_states"}
2024-06-28T04:34:40.651480Z [sync-driver-0] WARN state-sync/state-sync-driver/src/driver.rs:657 {"error":{"DataStreamNotificationTimeout":"5s"},"message":"Error found when checking the bootstrapper progress!","name":"driver"}`
JoshLind commented 5 days ago

@jinseon2416

  1. I took a look through all the changes between v1.12 and v1.13, and there's nothing that should affect state sync in the way you are seeing. The same is true for changes between v1.12 and v1.15. Likewise, this is still the only report of this type of issue.
  2. Are the logs provided by @m229abd (above), the same logs as your node? Or are those logs unrelated? Sorry, hard to tell who's within the same org :)
  3. If you need an archival mainnet node (and you're trying to sync from genesis, you should use intelligent syncing: https://aptos.dev/en/network/nodes/configure/state-sync#intelligent-syncing). Execution-only can be slow.
  4. Looking at the logs you have provided above, it seems to me that your node is experiencing a large amount of timeouts. My first recommendation is to inspect your network, e.g., bandwidth, latency, load balancers, firewalls, HAProxy, etc. You really shouldn't be timing out so much fetching the data.
  5. What does your node config look like? Are you hard coding seed peers? If so, are those peers unreliable/flaky?
  6. The logs above are still insufficient to really diagnose the problem. I'd recommend following the steps I mentioned above to produce a full log trace. That is, assuming the logs provided by @m229abd are unrelated.
  7. Can you elaborate on what you mean by Also, If you simply synchronize without querying the rpc API, the same symptom does not appear?
JoshLind commented 5 days ago

@p1xel32

  1. Thanks for pointing to your previous issue (https://github.com/aptos-labs/aptos-core/issues/13553). I'll update the node docs to point out that community snapshots don't provide indexer support. Because they are community snapshots, they are out of our control.
  2. But, my feeling is that the issue faced by @jinseon2416 is unrelated to this. If they had bootstrapped from a community snapshot without indexer support, I'd expect the node not to sync at all? As opposed to running for a while and then timing out.
JoshLind commented 5 days ago

@m229abd,

  1. Looking at the logs, I see these errors:

    Errors: [InvalidResponse(\"Unexpected error encountered: invalid value: integer `5`, expected variant index 0 <= i < 4\"), InvalidResponse(\"Unexpected error encountered: invalid value: integer `5`, expected variant index 0 <= i < 4\")],

    This tells me that your node is running an incompatible version with the peers on the rest of the network. When it makes calls to receive new data (after fast syncing), it is unable to parse the responses because of the message version incompatibility. Can you please update to the last mainnet version and retry? e.g., currently v1.15.2

  2. I suspect this is completely unrelated to the issue faced by @jinseon2416.

m229abd commented 5 days ago

Sorry for keeping you uninformed, but we did update shortly after v1.15.2 came out, and it did fix the synchronization. Thank you for following up on the issue.

On Tue, Jul 2, 2024, 16:09 Josh Lind @.***> wrote:

@m229abd https://github.com/m229abd,

  1. Looking at the logs, I see these errors:

Errors: [InvalidResponse(\"Unexpected error encountered: invalid value: integer 5, expected variant index 0 <= i < 4\"), InvalidResponse(\"Unexpected error encountered: invalid value: integer 5, expected variant index 0 <= i < 4\")],

This tells me that your node is running an incompatible version with the rest of the network. When it makes calls to receive new data (after fast syncing), it is unable to parse the responses because of the message version incompatibility. Can you please update to the last mainnet version? e.g., currently v1.15.2 https://github.com/aptos-labs/aptos-core/releases/tag/aptos-node-v1.15.2

β€” Reply to this email directly, view it on GitHub https://github.com/aptos-labs/aptos-core/issues/13609#issuecomment-2203057194, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3B2QABN5L6GT35SEBYCTLZKKNQJAVCNFSM6AAAAABI7G4HEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBTGA2TOMJZGQ . You are receiving this because you were mentioned.Message ID: @.***>