MystenLabs / sui

Sui, a next-generation smart contract platform with high throughput, low latency, and an asset-oriented programming model powered by the Move programming language
https://sui.io
Apache License 2.0
6.06k stars 11.14k forks source link

[Sui tools] Unable to download formal snapshots #19386

Closed Sceat closed 2 weeks ago

Sceat commented 3 weeks ago

Steps to Reproduce Issue

/data/tools/sui-tool download-formal-snapshot --network testnet --path /data/suidb --genesis /data/genesis.blob --num-parallel-downloads 25 --no-sign-request --latest

Expected Result

It should work properly

Actual Result

It ends up failing with an error like

Failed to get obj file epoch_491/1_14.obj after 6 attempts

trying multiple recent epoch doesn't help

Full script

code ``` - name: init-sui-tools image: ubuntu:latest command: ["/bin/sh", "-c"] args: - | echo "Starting init container to set up Sui tools and download snapshots..." if [ ! -d "/data/tools" ]; then echo "Sui tools not found. Installing necessary packages and downloading Sui tools..." mkdir -p /data/tools && cd /data/tools apt-get update apt-get install -y wget coreutils libpq-dev curl expect wget https://github.com/MystenLabs/sui/releases/download/{{ .Values.image.tag }}/sui-{{ .Values.image.tag }}-ubuntu-x86_64.tgz tar -xzvf sui-{{ .Values.image.tag }}-ubuntu-x86_64.tgz echo "Sui tools downloaded and extracted." else echo "Sui tools already exist, skipping download." fi if [ ! -f "/data/genesis.blob" ]; then cd /data echo "Downloading genesis blob..." curl -fLJO https://github.com/MystenLabs/sui-genesis/raw/main/{{ .Values.network }}/genesis.blob else echo "Genesis blob already exists, skipping download." fi download_snapshot() { echo "Downloading formal snapshot..." mkdir -p /data/suidb unbuffer /data/tools/sui-tool download-formal-snapshot --network {{ .Values.network }} --path /data/suidb --genesis /data/genesis.blob --num-parallel-downloads 25 --no-sign-request --epoch {{ .Values.config_sidecar.SNAPSHOT_EPOCH }} return $? } while true; do if [ ! -d "/data/suidb" ] || [ -z "$(ls -A /data/suidb)" ]; then if download_snapshot; then echo "Snapshot download successful." break else EXIT_STATUS=$? echo "Snapshot download failed. Exit status: $EXIT_STATUS" if [ -d "/data/suidb" ]; then echo "Removing failed suidb folder..." rm -rf /data/suidb fi echo "Retrying in 60 seconds..." sleep 60 fi else echo "Formal snapshot already exists, skipping download." break fi done echo "Init container setup complete." ```
johnjmartin commented 3 weeks ago

@Sceat thanks for the report. We're looking into the problem. In the meantime can you try using a db snapshot? Something like: /data/tools/sui-tool download-db-snapshot --network testnet --path /data/suidb --num-parallel-downloads 25 --no-sign-request --latest --skip-indexes should work

Sceat commented 3 weeks ago

seems 1TO isn't enough for the db-snapshot :/ any other workaround ?

johnjmartin commented 3 weeks ago

seems 1TO isn't enough for the db-snapshot :/ any other workaround ?

I'll assume you meant 1TB of disk. With --skip-indexes it should be just enough:

du -sh /opt/sui/db/authorities_db/full_node_db/live/*
75G /opt/sui/db/authorities_db/full_node_db/live/checkpoints
38M /opt/sui/db/authorities_db/full_node_db/live/epochs
20M /opt/sui/db/authorities_db/full_node_db/live/fullnode_pending_transactions
2.2T    /opt/sui/db/authorities_db/full_node_db/live/indexes
32G /opt/sui/db/authorities_db/full_node_db/live/rest_index
908G    /opt/sui/db/authorities_db/full_node_db/live/store

The stated fullnode hardware requirements are 4TB of disk: https://docs.sui.io/guides/operator/sui-full-node#hardware-requirements.

Sceat commented 2 weeks ago

regarding the initial issue, I did it manullay and here the logs

testnet

root@sui-node-testnet-0:/data# /data/tools/sui-tool download-formal-snapshot --network testnet --path /data/suidb --genesis /data/genesis.blob --num-parallel-downloads 25 --no-sign-request --latest
Beginning formal snapshot restore to end of epoch 497, network: Testnet, verification mode: Normal
[00:09:44] ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 1187 out of 1187 .ref files done (ref files download complete)
[00:00:52] ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 498/498 (Checkpoint summary sync is complete)
[00:00:02] ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 498/498 (Checkpoint summary verification is complete)
[00:10:13] ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 1187 out of 1187 ref files checksummed (Checksumming complete)
[00:10:13] ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 1187 out of 1187 ref files checksummed (Checksumming complete)
[00:52:54] ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 820 out of 1187 ref files accumulated from snapshot (file partitions per sec: 0.2583451628323909)thread 'tokio-runtime-worker' panicked at /home/runner/work/sui/sui/crates/sui-snapshot/src/reader.rs:391:45:
Failed to get obj file epoch_497/1_12.obj after 6 attempts
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)

mainnet

root@sui-node-testnet-0:/data# /data/tools/sui-tool download-formal-snapshot --network mainnet --path /data/suidb --genesis /data/genesis.blob --num-parallel-downloads 25 --no-sign-request --latest
Beginning formal snapshot restore to end of epoch 525, network: Mainnet, verification mode: Normal
[00:02:49] █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 578 out of 578 .ref files done (ref files download complete)
[00:27:50] █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 359/526 (checkpoints synced per sec: 0.21427051928046761)
[00:03:12] █████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 578 out of 578 ref files checksummed (Checksumming complete)
[00:21:52] ███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ 578 out of 578 ref files accumulated from snapshot (Accumulation complete)
[00:21:50] ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 1 out of 578 .obj files done (Download speed: 0.0552763508404269 MiB/s)thread 'main' panicked at /home/runner/work/sui/sui/crates/sui-tool/src/lib.rs:908:10:
Summaries task failed: failed to get

Caused by:
    0: error sending request for url (https://s3.us-west-2.amazonaws.com/mysten-mainnet-archives/epoch_358/30667378.sum)
    1: client error (SendRequest)
    2: channel closed
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Aborted (core dumped)
Sceat commented 2 weeks ago

I got it to work on another host, on both 1.28.3 and 1.33.2 version of sui-tool (still testnet)

Working host

Failing host

Note that the x86_64 could also fail in the past, the exact same setup was working on Vultr (still failing 60% of time), now switching on AWS makes it fail 100% of time, the problem seems to only occur while running in docker/kubernetes

SPEC Architecture: aarch64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: ARM Model name: Neoverse-N1 Model: 1 Thread(s) per core: 1 Core(s) per cluster: 8 Socket(s): - Cluster(s): 1 Stepping: r3p1 BogoMIPS: 243.75 Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp Caches (sum of all): L1d: 512 KiB (8 instances) L1i: 512 KiB (8 instances) L2: 8 MiB (8 instances) L3: 32 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-7 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; CSV2, BHB Srbds: Not affected Tsx async abort: Not affected
Sceat commented 2 weeks ago

I finally got it to work! By using --num-parallel-downloads 3 it was that simple...