filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.85k stars 1.27k forks source link

Splitstore: Defaults are not good #10699

Open RobQuistNL opened 1 year ago

RobQuistNL commented 1 year ago

Checklist

Lotus component

Lotus Version

1.21.0-rc3

Repro Steps

  1. Run a node
  2. Do a pruned import
  3. Set the HotStoreFullGCFrequency = 1 variable to do prunes as often as possible
  4. See that even then, the prune will never happen

Describe the Bug

After some investigation, I figured out that: HotStoreMaxSpaceThreshold

is actually defined as: "The maximum size the current hotstore + the potential new copy can occupy on disk"

Not "When HotStoreMaxSpaceTarget is set Moving GC will be triggered when total moving size exceeds HotstoreMaxSpaceTarget - HotstoreMaxSpaceThreshold" (as the docs state)

and HotstoreMaxSpaceSafetyBuffer should be defined as "the maximum size the new hotstore can be" instead of "Safety buffer to prevent moving GC from overflowing disk when HotStoreMaxSpaceTarget is set. Moving GC will not occur when total moving size exceeds HotstoreMaxSpaceTarget - HotstoreMaxSpaceSafetyBuffer"

Doc issues

The docs state these as defaults:

HotStoreMaxSpaceThreshold = 150000000000
HotstoreMaxSpaceSafetyBuffer = 50000000000

A node running without these values set will actually have these as defaults:

HotStoreMaxSpaceThreshold = 650000000000
HotstoreMaxSpaceSafetyBuffer = 50000000000

GC Hot CLI defaults

The docs state we should run lotus chain prune hot --periodic --threshold 0.00000001 and increase the number. The CLI default is 0.01, not 0.00000001.

Apart from that, its never explained what this threshold is. I now know its some magic badgerBS value, but still no idea what I'm actually setting when I change this value.

Default pruned chain examples

When running a node with a pruned chain, and HotStoreFullGCFrequency = 1, the first time I'm seeing a GC run, we get those logs. Meaning that the defaults make no sense - a freshly pruned chain will always exceed 50000000000 (the new hotstore's expected size is 245681686326)

It will also not trigger because the current size (245681686326 + current 448854471748 >= 650000000000)

Apart from these settings it looks like the prune logic doesn't take diskspace into account. I like that we can set our own thresholds, but in my case I just want 2 things:

In my opinion, with using a clearer set of configuration params, we can achieve a nice config setup;

OR

Then we should always know when we're coming close to a point of no return and have to GC.

Default config options could just trigger GC when the system notices we're about to run out of diskspace

Logging Information

json
{"level":"warn","ts":"2023-04-19T15:34:02.290Z","logger":"splitstore","caller":"splitstore/splitstore_compact.go:255","msg":"missing object reference bafy2bzaceapqmsgwyjvurmxgti73xfpnbyakxgyua33yobgqkdgaieuyu6eyq in bafy2bzacec4ltib5nbeudklbcfqtteygv4hxnhjapqeighppqbny6txunwuyy"}
(... a bunch of these "missing object reference" messages ...) 
{"level":"info","ts":"2023-04-19T15:34:19.246Z","logger":"splitstore","caller":"splitstore/splitstore_compact.go:1358","msg":"purged cold objects","purged":36013396,"live":568}
{"level":"info","ts":"2023-04-19T15:34:19.246Z","logger":"splitstore","caller":"splitstore/splitstore_compact.go:814","msg":"purging cold objects from hotstore done","took":258.911799297}
{"level":"info","ts":"2023-04-19T15:34:19.246Z","logger":"splitstore","caller":"splitstore/splitstore_compact.go:950","msg":"ending critical section"}
{"level":"info","ts":"2023-04-19T15:34:19.246Z","logger":"splitstore","caller":"splitstore/splitstore_compact.go:816","msg":"critical section done","total protected size":46828899582,"total marked live size":617650}
{"level":"info","ts":"2023-04-19T15:34:19.247Z","logger":"splitstore","caller":"splitstore/splitstore_gc.go:48","msg":"measured hot store size: 448854471748, approximate new size: 245681686326, should do full true, can do full false"}
{"level":"warn","ts":"2023-04-19T15:34:19.247Z","logger":"splitstore","caller":"splitstore/splitstore_gc.go:54","msg":"Attention! Estimated moving GC size 245681686326 is not within safety buffer 50000000000 of target max 650000000000, performing aggressive online GC to attempt to bring hotstore size down safely"}
{"level":"warn","ts":"2023-04-19T15:34:19.247Z","logger":"splitstore","caller":"splitstore/splitstore_gc.go:55","msg":"If problem continues you can 1) temporarily allocate more disk space to hotstore and 2) reflect in HotstoreMaxSpaceTarget OR trigger manual move with `lotus chain prune hot-moving`"}
{"level":"warn","ts":"2023-04-19T15:34:19.247Z","logger":"splitstore","caller":"splitstore/splitstore_gc.go:56","msg":"If problem continues and you do not have any more disk space you can run continue to manually trigger online GC at aggressive thresholds (< 0.01) with `lotus chain prune hot`"}
{"level":"info","ts":"2023-04-19T15:34:19.247Z","logger":"splitstore","caller":"splitstore/splitstore_gc.go:72","msg":"garbage collecting blockstore"}
{"level":"info","ts":"2023-04-19T15:36:15.119Z","logger":"splitstore","caller":"splitstore/splitstore_gc.go:81","msg":"garbage collecting blockstore done","took":115.87249909}
{"level":"info","ts":"2023-04-19T15:36:15.119Z","logger":"splitstore","caller":"splitstore/splitstore_gc.go:64","msg":"measured hot store size after GC: 454373389774"}
{"level":"info","ts":"2023-04-19T15:36:16.534Z","logger":"splitstore","caller":"splitstore/splitstore_compact.go:160","msg":"compaction done","took":43545.729444289}
{"level":"info","ts":"2023-04-19T15:38:20.772Z","logger":"splitstore","caller":"splitstore/splitstore_compact.go:858","msg":"preparing compaction transaction"}
RobQuistNL commented 1 year ago

Another note: The logs state

If problem continues and you do not have any more disk space you can run continue to manually trigger online GC at aggressive thresholds (< 0.01) with lotus chain prune hot

This tells me that a lower value is more agressive? The other docs tell me a higher value is more agressive..