MystenLabs / sui

Sui, a next-generation smart contract platform with high throughput, low latency, and an asset-oriented programming model powered by the Move programming language
https://sui.io
Apache License 2.0
5.84k stars 11.06k forks source link

sui-node corrupted after running for two hours: Illegal instruction #18252

Open zhangxf55 opened 1 week ago

zhangxf55 commented 1 week ago

Steps to Reproduce Issue

  1. Download the binary file from the github repo:
    cd /root/sui/bin && \
    wget https://github.com/MystenLabs/sui/releases/download/mainnet-v1.26.2/sui-mainnet-v1.26.2-ubuntu-x86_64.tgz && \
    tar -zxvf sui-mainnet-v1.26.2-ubuntu-x86_64.tgz && \
    rm -rf sui-mainnet-v1.26.2-ubuntu-x86_64.tgz
  2. Download the genesis.blob file from github repo:
    wget -O /root/sui/conf/genesis.blob https://github.com/MystenLabs/sui-genesis/raw/main/mainnet/genesis.blob
  3. Edit the config.yaml file:
    
    # SUI Databse Location
    db-path: "/data/sui"

For ipv4, update this to "/ipv4/X.X.X.X/tcp/8080/http"

network-address: "/dns/127.0.0.1/tcp/8080/http" metrics-address: "127.0.0.1:9184"

this address is also used for web socket connections

json-rpc-address: "127.0.0.1:9000" enable-event-processing: true

genesis:

Update this to the location of where the genesis file is stored

genesis-file-location: "/root/sui/conf/genesis.blob"

authority-store-pruning-config: num-latest-epoch-dbs-to-retain: 3 epoch-db-pruning-period-secs: 3600 num-epochs-to-retain: 1 max-checkpoints-in-batch: 10 max-transactions-in-batch: 1000 pruning-run-delay-seconds: 60

add p2p bootnodes, reference: https://docs.sui.io/guides/operator/sui-full-node

p2p-config: seed-peers:

zhangxf55 commented 1 week ago

I got the binary from 3 different places:

  1. aws cdn(from discord i got the link): https://sui-releases.s3-accelerate.amazonaws.com/f531168c745260b60a4ec4906c9f2b22240d872d/sui-node
  2. official repo on github: https://github.com/MystenLabs/sui/releases/tag/mainnet-v1.26.2
  3. compile the source code by myself:
    git checkout f531168c745260b60a4ec4906c9f2b22240d872d && cargo build --release

    All the three files have different md5 hash but have the same release version. I use them to run a full node but no succeeded. What mistakes have I made to cause this corruption? I have read the official documents and issues but found no clue.

johnjmartin commented 23 hours ago

All the three files have different md5 hash but have the same release version. I use them to run a full node but no succeeded. What mistakes have I made to cause this corruption? I have read the official documents and issues but found no clue.

Our rust builds are not deterministic, a different md5 hash for each is expected.

zhangxf55 commented 3 hours ago

I have downloaded the snapshot:

~/sui/bin/sui-tool download-db-snapshot --latest \
  --network="mainnet" \
  --path="/data/sui" \
  --no-sign-request

and start sui-node with the command:

/root/sui/bin/sui-node --config-path /root/sui/conf/config.yaml

the running logging:

2024-06-26T14:13:56.432916Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171972
2024-06-26T14:13:56.432944Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171973
2024-06-26T14:13:56.432974Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171974
2024-06-26T14:13:56.433003Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171975
2024-06-26T14:13:56.433034Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171976
2024-06-26T14:13:56.433062Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171977
2024-06-26T14:13:56.433092Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171978
2024-06-26T14:13:56.433126Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171979
2024-06-26T14:14:05.815079Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171980
2024-06-26T14:14:06.110739Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171981
2024-06-26T14:14:06.110810Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171982
2024-06-26T14:14:07.501641Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171983
2024-06-26T14:14:07.501710Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171984
2024-06-26T14:14:07.501740Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171985
2024-06-26T14:14:07.501767Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171986
2024-06-26T14:14:07.501794Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171987
2024-06-26T14:14:07.501822Z  INFO sui_core::checkpoints::checkpoint_executor: Received checkpoint summary from state sync sequence_number=37171988

Does it mean that the node is syncing and functioning normally ?

mystenmark commented 1 hour ago

That looks like everything is good now. The definitive check is to verify that the last executed checkpoint metric is increasing over time, which you can check by running:

curl -s http://localhost:9184/metrics | grep -w '^last_executed_checkpoint'

or by using a monitoring system (e.g. grafana)