Open ymonye opened 2 months ago
A few pointers that would help:
A few pointers that would help:
- You're using a old set of state-sync peers, you should update to our new list here: https://docs.sui.io/guides/operator/sui-full-node#set-up-from-source
- Looks like your node may have fallen out of sync because you do not have the archival fallback configured, I'd recommend setting that up: https://docs.sui.io/guides/operator/archives#set-up-archival-fallback
- One way to ensure you don't have to re-sync the whole node in the event of data loss would be to push some snapshots of the db to a object store: https://docs.sui.io/guides/operator/snapshots#enabling-db-snapshots
Thanks. Would step 2, having archival fallback configured, incur any fees on the AWS side? I see that requires having an AWS access key ID & secret access key. I'm assuming step 3 most certainly does, yet please correct me otherwise.
I've updated my list of state-sync peers, setup an archival fallback with an AWS account, and ensured it has the correct policies, but it seems my 2nd node sync cannot progress due to this warning:
fullnode-02-fullnode-1 | 2024-08-16T12:35:18.846945Z WARN handle_execution_effects{seq=46528740 epoch=487}: sui_core::checkpoints::checkpoint_executor: Transaction effects for checkpoint tx digests [TransactionDigest(BWVK834A5L8455uSWXhMfr4BtHAvqCmjr5fzN8JV6Qfn), TransactionDigest(76udeMZQq6A1Mw1ZMed8f9vu55uYRf6UEkDA2Gy1XczH), TransactionDigest(EdMd39mbFGo5Mj9B6FFW1WV4Y4qLN9KtJTNiZmNYGiZZ), TransactionDigest(2uz38pA1tHu3nhGPLk3c9YmGKX5Vi7QAmAUtdZe9dBi1), TransactionDigest(3BXn6HKGXymfdzR5DAwDd1oaBg3TpBzicGa9fYLo2nTw), TransactionDigest(4FkVSfRRN7dxrhwsbu9gniYCQ5KiyJQ5YdNXL9Zgcxc3), TransactionDigest(AJ6iKBrfYoKTPtSd2U7w3vDUuLoaRqpWnHQq6XQzKMxZ), TransactionDigest(BU61JziErtyrhw1uWM43gQHH9qdezZKk1KWmoZucFYT6), TransactionDigest(CaScdwj8YAmHTH1jcsMNvRyFtVikQKrH4JLhHtRhw3Ue), TransactionDigest(E6ss9oNBwapfGicehNJGNaXtAbf8WBdTQkcNtbZ2XAcM), TransactionDigest(HzyP298Ru1gJWZGUCkxVQF4fqQkiiivzQa9CGtnc59tq), TransactionDigest(msPsLNFPBDs3uZHAUgHGGHNsadTipwQ3LH8hXkPfJcR), TransactionDigest(CdBTjzxguG3M1MiSiLjPPGwEHcfXJBwnY38FYzDdmUCX), TransactionDigest(DUoiNN7ZZsYhrCLGtppC6eV2FFqBKkuqA7kSRwcg9A6C), TransactionDigest(H59VaeC8FazAj3CBgyP8fF83ucGtQB35hTjHN3ohUZiX)] not present within 840s.
fullnode-02-fullnode-1 | 2024-08-16T12:35:18.846990Z WARN handle_execution_effects{seq=46528740 epoch=487}: sui_core::checkpoints::checkpoint_executor: Transaction TransactionDigest(BWVK834A5L8455uSWXhMfr4BtHAvqCmjr5fzN8JV6Qfn) has missing input objects [VersionedObject { id: 0x0000000000000000000000000000000000000000000000000000000000000006, version: SequenceNumber(61149607) }]
My primary node still cannot get past the error:
fullnode-01-fullnode-1 | thread 'sui-node-runtime' panicked at crates/sui-core/src/checkpoints/checkpoint_executor/mod.rs:1223:2024-08-16T12:14:06.714753Z ERROR telemetry_subscribers: panicked at crates/sui-core/src/checkpoints/checkpoint_executor/mod.rs:1223:21:
fullnode-01-fullnode-1 | Transaction effects for effects digest TransactionEffectsDigest(F7oPMEsvEiMbVi97HAwz6LsQHnHhSFYrYkNHceQ6emCC) do not exist in effects table panic.file="crates/sui-core/src/checkpoints/checkpoint_executor/mod.rs" panic.line=1223 panic.column=21
fullnode-01-fullnode-1 | 21:
fullnode-01-fullnode-1 | Transaction effects for effects digest TransactionEffectsDigest(F7oPMEsvEiMbVi97HAwz6LsQHnHhSFYrYkNHceQ6emCC) do not exist in effects table
fullnode-01-fullnode-1 | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Is this one cooked?
Also I don't quite understand the following from the docs: https://docs.sui.io/guides/operator/archives
Sui Archival nodes (Full nodes that write to an archive) don't store historical state on local storage and don't help query historical data. They serve the purpose of enabling peer nodes to catch up to the latest checkpoint and are useful for auditing and verifying the complete history of all transactions on the network.
So how exactly would I run a Sui node that queries historical data (all transactions / effects / events / etc), and has all archival data under my local machine? I was under the impression that's what I was doing, especially with my pruning config:
num-latest-epoch-dbs-to-retain: 500000
num-epochs-to-retain: 500000
num-epochs-to-retain-for-checkpoints: 500000
cc @phoenix-o for the panic
Also I don't quite understand the following from the docs:
What those docs are describing is a fullnode which has a singlar purpose of writing sui archival data to a GCS/S3 bucket. This would be in the case where you don't want to rely on s3://mysten-mainnet-archives
and instead write sui archival data to your own bucket.
You're correct that those pruning configs will retain a large amount of historical data in the fullnode itself.
cc @phoenix-o for the panic
Also I don't quite understand the following from the docs:
What those docs are describing is a fullnode which has a singlar purpose of writing sui archival data to a GCS/S3 bucket. This would be in the case where you don't want to rely on
s3://mysten-mainnet-archives
and instead write sui archival data to your own bucket.You're correct that those pruning configs will retain a large amount of historical data in the fullnode itself.
Thanks John. I figured I'll go ahead and attempt a new sync for a Sui mainnet (archive) node, as I await the next steps for recovering my 2 other nodes.
Is the best practice to run a new sync (with full archive transactions / effects / events / states) with the above values or fullnode-template.yml, as well as the recommended archival fallback? At the moment it's not economical for us to push snapshots to Amazon's Object Store.
My initial two Sui nodes were synced since mainnet launch, and I recall some people in the Discord stated they were unable to sync today from checkpoint 0. Being able to sync from the very beginning with all historical transactions is critical for the service we provide, and as of now we have 2 inoperable Sui nodes.
My initial two Sui nodes were synced since mainnet launch, and I recall some people in the Discord stated they were unable to sync today from checkpoint 0.
If you properly setup the archival fallback, and have a node with sufficient bandwidth, you should have no problem syncing from the beginning of history. It could take quite a long time though (weeks).
Is the best practice to run a new sync (with full archive transactions / effects / events / states) with the above values or fullnode-template.yml, as well as the recommended archival fallback?
hard to say without understanding why you need all historical transactions. another option is to create a custom indexer of your own that only stores the txn data you need in whatever format you prefer, this is a fairly new feature though
My initial two Sui nodes were synced since mainnet launch, and I recall some people in the Discord stated they were unable to sync today from checkpoint 0.
If you properly setup the archival fallback, and have a node with sufficient bandwidth, you should have no problem syncing from the beginning of history. It could take quite a long time though (weeks).
Is the best practice to run a new sync (with full archive transactions / effects / events / states) with the above values or fullnode-template.yml, as well as the recommended archival fallback?
hard to say without understanding why you need all historical transactions. another option is to create a custom indexer of your own that only stores the txn data you need in whatever format you prefer, this is a fairly new feature though
Thanks John. My company LedgerQL indexes historical transactions for multiple blockchains into a local database, and provides the ability for lookups & analytics as a service. This includes not only Sui but the various EVM / Cosmos SDK chains. It's imperative that all transactional data is available in our nodes, in the event we'd need to backfill erroneous data, which is a reason why we run archive nodes across the board.
My indexed Sui data is only 8 days delayed, which is when our nodes initially went out of sync. As of now it looks like we'd need to delete 7TB of Sui blockchain data since they've entered a corrupt state after a power interruption, which has never happened on any of the other blockchains we run. I'm hoping your team could provide a solution for our 2 nodes that are unable to resume syncing, as it wouldn't make sense for a blockchain node to enter a corrupt state after a power interruption.
My only concern is if we had to delete the 7TB of data and start over from checkpoint 0, the new sync would fail to include all historical data if we needed to backfill historical transactions / effects / events from a year ago. My prior 2 nodes were synced since the initial Sui mainnet launch with pruning "disabled," so I didn't have this concern at the time.
I'm hoping your team could provide a solution for our 2 nodes that are unable to resume syncing, as it wouldn't make sense for a blockchain node to enter a corrupt state after a power interruption.
Have you configured the archival fallback on the node that is not corrupted but having trouble syncing? I just synced a node from ~30 epochs ago to the latest tip with no issues. Other issues to check would be the system health and network on the machine where this fullnode is running. Also what are the specs of the host that is running this fullnode (RAM, CPU, disk, network bandwidth)?
I'm hoping your team could provide a solution for our 2 nodes that are unable to resume syncing, as it wouldn't make sense for a blockchain node to enter a corrupt state after a power interruption.
Have you configured the archival fallback on the node that is not corrupted but having trouble syncing? I just synced a node from ~30 epochs ago to the latest tip with no issues. Other issues to check would be the system health and network on the machine where this fullnode is running. Also what are the specs of the host that is running this fullnode (RAM, CPU, disk, network bandwidth)?
Yes, I have archival fallback setup for both nodes with an AWS account & correct permissions, per your earlier recommendation. This machine is pretty overkill for this node: 2TB DDR4 RAM, AMD 7773X (64-cores), dedicated 32TB Micron 9400 Pro NVME SSD (1.5M read IOPS / 300k write IOPS), 10Gbps unmetered bandwidth.
The uncorrupted node, I assumed both were corrupt, is unable to catchup during the sync process with the repeated below error:
2024-08-22T00:58:16.643398Z INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=49881511 2024-08-22T00:58:16.819658Z INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=49881512 2024-08-22T00:58:16.977475Z WARN handle_execution_effects{seq=46528740 epoch=487}: sui_core::checkpoints::checkpoint_executor: Transaction effects for checkpoint tx digests [TransactionDigest(BWVK834A5L8455uSWXhMfr4BtHAvqCmjr5fzN8JV6Qfn), TransactionDigest(76udeMZQq6A1Mw1ZMed8f9vu55uYRf6UEkDA2Gy1XczH), TransactionDigest(EdMd39mbFGo5Mj9B6FFW1WV4Y4qLN9KtJTNiZmNYGiZZ), TransactionDigest(2uz38pA1tHu3nhGPLk3c9YmGKX5Vi7QAmAUtdZe9dBi1), TransactionDigest(3BXn6HKGXymfdzR5DAwDd1oaBg3TpBzicGa9fYLo2nTw), TransactionDigest(4FkVSfRRN7dxrhwsbu9gniYCQ5KiyJQ5YdNXL9Zgcxc3), TransactionDigest(AJ6iKBrfYoKTPtSd2U7w3vDUuLoaRqpWnHQq6XQzKMxZ), TransactionDigest(BU61JziErtyrhw1uWM43gQHH9qdezZKk1KWmoZucFYT6), TransactionDigest(CaScdwj8YAmHTH1jcsMNvRyFtVikQKrH4JLhHtRhw3Ue), TransactionDigest(E6ss9oNBwapfGicehNJGNaXtAbf8WBdTQkcNtbZ2XAcM), TransactionDigest(HzyP298Ru1gJWZGUCkxVQF4fqQkiiivzQa9CGtnc59tq), TransactionDigest(msPsLNFPBDs3uZHAUgHGGHNsadTipwQ3LH8hXkPfJcR), TransactionDigest(CdBTjzxguG3M1MiSiLjPPGwEHcfXJBwnY38FYzDdmUCX), TransactionDigest(DUoiNN7ZZsYhrCLGtppC6eV2FFqBKkuqA7kSRwcg9A6C), TransactionDigest(H59VaeC8FazAj3CBgyP8fF83ucGtQB35hTjHN3ohUZiX)] not present within 3630s. 2024-08-22T00:58:16.977564Z WARN handle_execution_effects{seq=46528740 epoch=487}: sui_core::checkpoints::checkpoint_executor: Transaction TransactionDigest(BWVK834A5L8455uSWXhMfr4BtHAvqCmjr5fzN8JV6Qfn) has missing input objects [VersionedObject { id: 0x0000000000000000000000000000000000000000000000000000000000000006, version: SequenceNumber(61149607) }]
I'm hoping your team could provide a solution for our 2 nodes that are unable to resume syncing, as it wouldn't make sense for a blockchain node to enter a corrupt state after a power interruption.
Have you configured the archival fallback on the node that is not corrupted but having trouble syncing? I just synced a node from ~30 epochs ago to the latest tip with no issues. Other issues to check would be the system health and network on the machine where this fullnode is running. Also what are the specs of the host that is running this fullnode (RAM, CPU, disk, network bandwidth)?
How are you able to sync a node from ~30 epochs ago? Is it possible to start a new sync from a specific point in history, instead of checkpoint 0? I've already started a new sync and it's currently 9 months delayed... worst case would take a month to complete which is a major inconvenience.
The best thing that would help me is to run a separate node starting from whichever epoch was ~20 days ago, then just sync to the latest checkpoint. That way we can resume our indexing process since it's only 16 days behind (when both nodes suffered the datacenter power loss & corruption).
How are you able to sync a node from ~30 epochs ago? Is it possible to start a new sync from a specific point in history, instead of checkpoint 0?
by starting the node from a database snapshot. We do have a unpruned db snapshot available if you'd like to restore from that, the bucket is public, with requester pays enabled. You can download it via the aws cli, eg: aws s3 cp --recursive --request-payer requester s3://mysten-mainnet-obj-snapshots/epoch_501/ /sui/db/live
, just a warning that the current snapshot size for an unpruned snapshot is almost 7TB.
I've updated my list of state-sync peers, setup an archival fallback with an AWS account, and ensured it has the correct policies, but it seems my 2nd node sync cannot progress due to this warning:
fullnode-02-fullnode-1 | 2024-08-16T12:35:18.846945Z WARN handle_execution_effects{seq=46528740 epoch=487}: sui_core::checkpoints::checkpoint_executor: Transaction effects for checkpoint tx digests [TransactionDigest(BWVK834A5L8455uSWXhMfr4BtHAvqCmjr5fzN8JV6Qfn), TransactionDigest(76udeMZQq6A1Mw1ZMed8f9vu55uYRf6UEkDA2Gy1XczH), TransactionDigest(EdMd39mbFGo5Mj9B6FFW1WV4Y4qLN9KtJTNiZmNYGiZZ), TransactionDigest(2uz38pA1tHu3nhGPLk3c9YmGKX5Vi7QAmAUtdZe9dBi1), TransactionDigest(3BXn6HKGXymfdzR5DAwDd1oaBg3TpBzicGa9fYLo2nTw), TransactionDigest(4FkVSfRRN7dxrhwsbu9gniYCQ5KiyJQ5YdNXL9Zgcxc3), TransactionDigest(AJ6iKBrfYoKTPtSd2U7w3vDUuLoaRqpWnHQq6XQzKMxZ), TransactionDigest(BU61JziErtyrhw1uWM43gQHH9qdezZKk1KWmoZucFYT6), TransactionDigest(CaScdwj8YAmHTH1jcsMNvRyFtVikQKrH4JLhHtRhw3Ue), TransactionDigest(E6ss9oNBwapfGicehNJGNaXtAbf8WBdTQkcNtbZ2XAcM), TransactionDigest(HzyP298Ru1gJWZGUCkxVQF4fqQkiiivzQa9CGtnc59tq), TransactionDigest(msPsLNFPBDs3uZHAUgHGGHNsadTipwQ3LH8hXkPfJcR), TransactionDigest(CdBTjzxguG3M1MiSiLjPPGwEHcfXJBwnY38FYzDdmUCX), TransactionDigest(DUoiNN7ZZsYhrCLGtppC6eV2FFqBKkuqA7kSRwcg9A6C), TransactionDigest(H59VaeC8FazAj3CBgyP8fF83ucGtQB35hTjHN3ohUZiX)] not present within 840s. fullnode-02-fullnode-1 | 2024-08-16T12:35:18.846990Z WARN handle_execution_effects{seq=46528740 epoch=487}: sui_core::checkpoints::checkpoint_executor: Transaction TransactionDigest(BWVK834A5L8455uSWXhMfr4BtHAvqCmjr5fzN8JV6Qfn) has missing input objects [VersionedObject { id: 0x0000000000000000000000000000000000000000000000000000000000000006, version: SequenceNumber(61149607) }]
My primary node still cannot get past the error:
fullnode-01-fullnode-1 | thread 'sui-node-runtime' panicked at crates/sui-core/src/checkpoints/checkpoint_executor/mod.rs:1223:2024-08-16T12:14:06.714753Z ERROR telemetry_subscribers: panicked at crates/sui-core/src/checkpoints/checkpoint_executor/mod.rs:1223:21: fullnode-01-fullnode-1 | Transaction effects for effects digest TransactionEffectsDigest(F7oPMEsvEiMbVi97HAwz6LsQHnHhSFYrYkNHceQ6emCC) do not exist in effects table panic.file="crates/sui-core/src/checkpoints/checkpoint_executor/mod.rs" panic.line=1223 panic.column=21 fullnode-01-fullnode-1 | 21: fullnode-01-fullnode-1 | Transaction effects for effects digest TransactionEffectsDigest(F7oPMEsvEiMbVi97HAwz6LsQHnHhSFYrYkNHceQ6emCC) do not exist in effects table fullnode-01-fullnode-1 | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Is this one cooked?
Also I don't quite understand the following from the docs: https://docs.sui.io/guides/operator/archives
Sui Archival nodes (Full nodes that write to an archive) don't store historical state on local storage and don't help query historical data. They serve the purpose of enabling peer nodes to catch up to the latest checkpoint and are useful for auditing and verifying the complete history of all transactions on the network.
So how exactly would I run a Sui node that queries historical data (all transactions / effects / events / etc), and has all archival data under my local machine? I was under the impression that's what I was doing, especially with my pruning config:
num-latest-epoch-dbs-to-retain: 500000 num-epochs-to-retain: 500000 num-epochs-to-retain-for-checkpoints: 500000
Any help with the second node (fullnode-02-fullnode-1)? I fear in the future I will be shit-outta-luck if there's any future power loss, despite utilizing archival fallback. I still don't understand why an interruption of syncing would completely corrupt these Sui nodes, whereas my EVM / Cosmos SDK / Solana nodes had no issue.
Also I'm unsure if enabling db-snapshots within AWS would be economical on our end.
Hi team,
My RPC node service LedgerQL has experienced two unplanned power losses this past year on servers running Sui (archive / non-prunned) mainnet nodes, and upon server reboot these Sui nodes have been unable to re-sync. I'm opening up this issue not only for a resolution to recover both of our nodes, but to also address concerns with how Sui handles any unplanned service interruptions, as we've not faced this with any of our EVM, Cosmos SDK, Solana, and alternative MoveVM (Aptos) nodes. Also these Sui mainnet nodes are roughly 7TB in size, so a full snapshot recovery is incredibly taxing from a resource standpoint.
The 1st occurrence was February 29th 2024, from the below Discord link: https://discord.com/channels/916379725201563759/968392942517649438/1212535832045551736
After a server boot, one of our Sui mainnet nodes produced the below logs, failing to sync, while the other node synced without issue. We were unable to come to a resolution within the Discord conversation, so that particular Sui node was scrubbed and replaced with the secondary node that was syncing without issue.
Yesterday after a power loss & recovery, both of our nodes are unable to recover. The primary node returns the below, erroring & exiting the application after a few seconds:
The second Sui node returns the below, without syncing any new checkpoints after running for several hours:
Both nodes are running on v1.30.1, mysten/sui-node:a4185da5659d8d299d34e1bb2515ff1f7e32a20a, using the Docker compose from https://github.com/MystenLabs/sui/tree/main/docker/fullnode.
Below are the contents of fullnode-template.yml: