filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.83k stars 1.25k forks source link

[Sync stuck]One of lotus fvm stucks sometimes #11536

Open OneilYang opened 9 months ago

OneilYang commented 9 months ago

Checklist

Lotus component

Lotus Version

lotus version 1.25.1

Repro Steps

  1. Start lotus daemon
  2. Check "lotus sync status"
  3. After long running, sometimes it displays: Worker xxx: (first one) Stage: header sync Elapsed: xx Hour yy M ... <-- stuck here very long time

For now, "lotus sync wait" only displays this worker's output, but chain sync seems ok('lotus-miner info' displays sync ok);

if use splitstore, it will check "sync wait" in the compaction, and then lotus will stuck entirely.

Describe the Bug

  1. Start lotus daemon
  2. Check "lotus sync status"
  3. After long running, sometimes it displays: Worker xxx: (first one) Stage: header sync Elapsed: xx Hour yy M ... <-- stuck here very long time

For now, "lotus sync wait" only displays this worker's output, but chain sync seems ok('lotus-miner info' displays sync ok);

if use splitstore, it will check "sync wait" in the compaction, and then lotus will stuck entirely.

Logging Information

Worker xxx: (first one)
        Stage: header sync
        Elapsed: xx Hour yy M ...  <-- stuck here very long time
OneilYang commented 9 months ago

Is there any overtime check in the lotus worker status?

I supposed the overtime check is necessary, it should save application for many reasons including this one.

OneilYang commented 7 months ago

Hi, the stuck "worker" actually is the one thread of lotus daemon, not the lotus-worker process, I suppose the label should be "area/lotus..." or something about lotus. tks

OneilYang commented 7 months ago

and seems I can reproduce it with long time running on many machines, if you want to debug it. please let me know.

OneilYang commented 7 months ago

I found lots of user have this issue, and seems they dont know how it happened: one of fvm stuck --> now sync still works until splistore start --> splitsotre will check sync status, so lotus is over they just feel the splitstore is not stable enough, actually it's the problem about sync issue;

how to resolve it:

  1. find the root cause, but seems it's not so easy; (it's a libp2p network issue? I guess...)
  2. add overtime protect to fvm, it can avoid this issue and all other issues in the future; (for example 1-2 mins)
  3. splitstone check the sync status with other ways; (skip the problem) thanks~