filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.85k stars 1.27k forks source link

[Sealing Issue] Sectors stuck in PreCommitWait with all transactions executed #6131

Open utgarda opened 3 years ago

utgarda commented 3 years ago

I understand this probably wasn't the normal workflow, please advise if I can restore the 3 sectors or if they should be removed.

Describe the problem Had Lotus daemon restarted and syncing for some time while miner kept running, had to remove .lotus/kvlog before I was able to get the daemon running again.

3 sectors keep PreCommitWait state shown forever, with last log lines like

5.      2021-04-25 09:49:02 +0300 MSK:  [event;sealing.SectorPreCommitted]      {"User":{"Message":{"/":"bafy2bzacebghhky4vuqdhvfnfv7de3jztioetxxw3s4tg4vcab37ry62x53uu"},"PreCommitDeposit":"49621133101988814","PreCommitInfo":{"SealProof":8,"SectorNumber":20,"SealedCID":{"/":"bagboea4b5abcbw7mdd2cjfeofi2lr35iuvpz5eu4gmqcitewqa3dvwpn3v7tlt3s"},"SealRandEpoch":698901,"DealIDs":[],"Expiration":2253883,"ReplaceCapacity":false,"ReplaceSectorDeadline":0,"ReplaceSectorPartition":0,"ReplaceSectorNumber":0}}}
6.      2021-04-25 16:45:15 +0300 MSK:  [event;sealing.SectorChainPreCommitFailed]      {"User":{}}
        handler: websocket connection closed
7.      2021-04-25 23:52:44 +0300 MSK:  [event;sealing.SectorRetryPreCommitWait]        {"User":{}}

No local pending transactions.

Version

The output of lotus --version.

VERSION:
   1.6.0+mainnet+git.3fc23a785

Setup You miner and daemon setup, including what hardware do you use, your environment variable settings, how do you run your miner and worker, do you use GPU and etc.

Everything on one box:

AMD EPYC 7452 (32x2.35 GHz) H11SSL-i-B 8 × 64 GB DDR4 ECC Reg 16 TB HDD SATA, 2 × 480 GB SSD SATA, 3 × 3840 GB SSD NVMe Tesla T4 16 GB GDDR6

Daemon service:

cat /etc/systemd/system/lotus-daemon.service 
[Unit]
Description=Lotus Daemon
After=network-online.target
Requires=network-online.target

[Service]
User=lotus
Group=root

Environment=GOLOG_FILE="/var/log/lotus/daemon.log"
Environment=GOLOG_LOG_FMT="json"
Environment=LOTUS_PATH="/fc/fast/.lotus"
Environment=LOTUS_MINER_PATH="/fc/fast/.lotusminer"
Environment=TMPDIR="/fc/fast/tmpdir"
Environment=BELLMAN_CPU_UTILIZATION=0.875
Environment=FIL_PROOFS_MAXIMIZE_CACHING=1
Environment=FIL_PROOFS_USE_GPU_COLUMN_BUILDER=1
Environment=FIL_PROOFS_USE_GPU_TREE_BUILDER=1
Environment=FIL_PROOFS_USE_MULTICORE_SDR=1
Environment=FIL_PROOFS_PARAMETER_CACHE="/fc/fast/proofs/parameter_cache"
Environment=FIL_PROOFS_PARENT_CACHE=/fc/fast/proofs/parent_cache

Environment=LOTUS_BACKUP_BASE_PATH=/home/lotus/backup

#ExecStart=/usr/local/bin/lotus daemon --import-snapshot /fc/fast/2/minimal_finality_stateroots_701860_2021-04-25_15-00-00.car
ExecStart=/usr/local/bin/lotus daemon
Restart=always
RestartSec=10

MemoryAccounting=true
MemoryHigh=8G
MemoryMax=10G
LimitNOFILE=8192:10240

[Install]
WantedBy=multi-user.target

Miner service:

cat /etc/systemd/system/lotus-miner.service 
[Unit]
Description=Lotus Miner
After=network.target
After=lotus-daemon.service
Wants=lotus-daemon.service

[Service]
User=lotus
Group=root
ExecStart=/usr/local/bin/lotus-miner run --nosync
Environment=GOLOG_FILE="/var/log/lotus/miner.log"
Environment=GOLOG_LOG_FMT="json"
Environment=LOTUS_PATH="/fc/fast/.lotus"
Environment=LOTUS_MINER_PATH="/fc/fast/.lotusminer"
Environment=TMPDIR="/fc/fast/tmpdir"
Environment=BELLMAN_CPU_UTILIZATION=0.875
Environment=FIL_PROOFS_MAXIMIZE_CACHING=1
Environment=FIL_PROOFS_USE_GPU_COLUMN_BUILDER=1
Environment=FIL_PROOFS_USE_GPU_TREE_BUILDER=1
Environment=FIL_PROOFS_USE_MULTICORE_SDR=1
Environment=FIL_PROOFS_PARAMETER_CACHE="/fc/fast/proofs/parameter_cache"
Environment=FIL_PROOFS_PARENT_CACHE=/fc/fast/proofs/parent_cache

[Install]
WantedBy=multi-user.target

lotus-config.toml.gz lotusminer-config.toml.gz

Commands

lotus-storage-miner sectors pledge

Sectors status

The output of lotus-miner sectors status --log <sectorId> for the failed sector(s). sector_20.log.gz sector_21.log.gz sector_22.log.gz lotus-config.toml.gz

Lotus miner logs miner.log.gz

Lotus miner diagnostic info

Code modifications No modifications

jennijuju commented 3 years ago

@utgarda first thing first - you need to update your node soon to v1.8.0^ for the network upgrade -> v12, or you will lose sync with the network https://github.com/filecoin-project/lotus/discussions/6084#discussioncomment-663039

utgarda commented 3 years ago

@jennijuju thanks for the heads-up! Updated Lotus, the above thing stayed. Should I just remove the sectors, or try to recover them?