filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.81k stars 1.25k forks source link

Fetch failure due to lack of disk space #4069

Open hyunmoon opened 3 years ago

hyunmoon commented 3 years ago

Describe the bug

WARN    rpc go-jsonrpc@v0.1.2-0.20200822201400-474f4fdccc52/handler.go:241  error in RPC call to 'Filecoin.Fetch': allocate local sector for fetching:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Remote).AcquireSector
        /home/downloads/lotus/extern/sector-storage/stores/remote.go:111
  - couldn't find a suitable path for a sector:
    github.com/filecoin-project/lotus/extern/sector-storage/stores.(*Local).AcquireSector
        /home/downloads/lotus/extern/sector-storage/stores/local.go:402

If a worker happened to get more sectors than it can handle, eventually it runs out of disk space. When it happens, all sectors that belong to the worker fail to finalize because they can't be fetched.

Usually when this happens, I can find some files that the worker should no longer have.

report_fetch report_fetch_2

This time, I found 682G worth of cache files in cache/fetching directory. They are all in Proving state so they should have been deleted.

To Reproduce

  1. A worker dedicated to PC2 takes sectors while it has some sectors in WaitSeed, Commiting, and Finalize state.
  2. Some sectors that are now proving didn't get deleted from the worker properly.
  3. The worker runs out of disk space.
  4. The sectors it had can't be finalized.
  5. From that point, the worker does nothing but outputting errors.

Expected behavior

  1. Any obsolete sector files should always be deleted from the worker.
  2. If the worker can't do anything due to lack of disk space, it should try to move some sectors to the miner or other workers to make some space.

Version (run lotus version): Tag v0.8.0

Additional context I think this is happening because I simply don't have enough disk space in each worker. In theory, it should be enough but sometimes, some workers get more sectors than others and run out of disk space.

Related to: https://github.com/filecoin-project/lotus/issues/3969 https://github.com/filecoin-project/lotus/issues/4015 https://filecoinproject.slack.com/archives/C0179RNEMU4/p1601262746147700

shotcollin commented 2 years ago

I'm also seeing this quite often. I spend a lot of time searching for obsolete sector files and manually deleting them. Is there still no resolution/workaround?