rjan90 commented 6 days ago

When benchmarking the nv23 migration on my mainnet node, I get these bench-times:

Migration without cache is taking 41seconds:

migration height  4056425
old cid  bafy2bzaceafswlgag4nowendbxzwxvvt4ofn4vyp75kbmdods4x2vbsepgt5q
new cid  bafy2bzacecparpmzboywlcpuqz5vp2sn47lk4d3bc24oksvynlbfs4v7vngwy
completed round actual (without cache), took  41.730056334s
manifest cid: bafy2bzacecbueuzsropvqawsri27owo7isa5gp2qtluhrfsto2qg7wpgxnkba

Migration with cache is taking 36 seconds which is higher then what our threshold for migration times - which should be less then 1 epoch (30s)

completed round actual (with cache), took  36.853138672s

These high times are unexpected given that the migration for network upgrade is very light.
Resource usage when running the migration on this node is low, its consuming around 10GiB.

ZenGround0 commented 6 days ago

All right here's a high level summary of time spent. For more precise understanding of the code corresponding to each log line check out the timing branch here: https://github.com/filecoin-project/go-state-types/compare/master...spike/timing-migration Timing without cache

./lotus-shed migrate-state --repo=/mnt/nvmeraid0/daemon/ 23 bafy2bzacedgsnt7hkqwisxxfcndalrf4claltpyel6k6hixqjyhikg53puq6u
2024-07-03T19:03:27.924Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to initialize migration: 461.255µs
2024-07-03T19:03:49.019Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to run migration: 21.095439755s
2024-07-03T19:03:49.019Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to run f090 migration: 44.545µs
2024-07-03T19:03:51.295Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to flush actorsOut: 2.275956678s
completed round actual (without cache), took  24.891692657s

Timing with cache

2024-07-03T19:03:52.941Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to initialize migration: 342.139µs
2024-07-03T19:04:24.281Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to run migration: 31.339779092s
2024-07-03T19:04:24.281Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to run f090 migration: 34.505µs
2024-07-03T19:04:26.604Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to flush actorsOut: 2.323670911s
completed premigration, took  35.326578544s
2024-07-03T19:04:28.152Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to initialize migration: 250.044µs
2024-07-03T19:04:49.162Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to run migration: 21.009924062s
2024-07-03T19:04:49.162Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to run f090 migration: 29.727µs
2024-07-03T19:04:51.468Z    WARN    fil-consensus   filcns/upgrades.go:2743 time to flush actorsOut: 2.305785299s
completed round actual (with cache), took  25.048051787s

ZenGround0 commented 6 days ago

It's clear the main issue is the baseline migration of code cids. It's unclear why this is slower than expected. Also memory usage seems about half of what I'd expect (8 bytes for actor cid * ~3_000_000 actors ~24 GiB). Some hypotheses are -- 1) caching isn't as good as it should/could be 2) we're hitting some internal inflection point in go map performance vs size since there are more actors in the state tree.

At this point I'm suspicious we're not going to be able to make this significantly faster by tomorrow. But I'll keep looking and run a pprof trace.

ZenGround0 commented 6 days ago

nv23-migration-mainnet.zip Im not great at parsing pprof diagrams but heres both cached and uncached. I think I'll try to isolate out only cached

ZenGround0 commented 6 days ago

From what I can see it looks like go GC is the expensive thing

ZenGround0 commented 6 days ago

Watching htop I see a max memory of 15 GB and it looks like the cache is taking 11GB this seems off by 2 so I'll look into measuring cache size.

--edit-- Measured cache size, it grows to > 3M the same number as migration jobs, nothing is going wrong there though I wonder why memory overhead is so low?

--edit 2-- I never figured this out, maybe some cids are identity cids? maybe some state head cids are the same and go reuses internally?

ZenGround0 commented 6 days ago

Analysis

Here is a reference to the testing time of the nv21 migration: https://filecoinproject.slack.com/archives/CP50PPW2X/p1693544595593359

This shows migration time with cache ~20s. @rjan90 shared a lower hardware spec at ~37s with cache above but on a good hardware spec machine we see ~24s with cache for the nv23 migration.

Note that between the time of that dry run and now the network has grown by ~640,000 actors or about 25% from its ~2500000 actor size at the time. Under the assumptions that the ~20s nv21 migration with cache was 1) bottlenecked by cache reads and not redoing actual migration work 2) the nv21 dry run was on good hardware (I remember it being run last year and Im confident about this) then our 24s migration with cache time lines up very well with the model of a linear increase in time for each actor. This is a model we should expect since the top level migration traverses all actors. Therefore the slowdown is expected.

Solutions

Smarter use of GC

As shown in the pprof profile above the CPU bottleneck for running this migration is all in GC and memory management. This seems suboptimal because the migration did not take up much memory (~11 GB for cache ~15 high water mark). Node operators have a lot of memory so if we could figure out how to trade off more memory for less time this would be ideal. I.e. turn off GC completely for data alloced from migration go routines.

My initial experiments setting GOGC = 400 and GOGC = off did not improve migration time and I have reached the limits of my understanding of go GC to help solve this. But in principle it appears we could engineer a more memory intense trade off.

Top level HAMT diff

To get sublinear we have to stop traversing every actor. Fortunately this is possible by caching nodes in the top level HAMT and skipping re-migration of the subtrees with no changes. I expect this to speed things up significantly in practice. However it is a big departure from the migration model that we've used for around 4 years so we'll want to proceed carefully. IMO we should not try to do this without significantly slipping the nv23 deadline

What should we do

The migration times are not ideal but probably safe to run on mainnet. Our two options are

postpone the release until we get lower migration time (~1-2 weeks of engineering work)
accept 25 - 40s migration times

ZenGround0 commented 6 days ago

FYI @Stebalien I'm sure you have thoughts about both enabling more aggressive memory use and HAMT diffing solutions and possibly other ideas not yet considered.

rvagg commented 2 days ago

@ZenGround0 can you explain what the migration is doing by touching pretty much all the actors? I was running it with some debug printing and seeing it touch ~6M of them and I'm not sure why. I understand that the actors code CIDs have changed, but why are we going through apparently all of the on-chain actors here: https://github.com/filecoin-project/go-state-types/blob/master/migration/util.go#L212

ZenGround0 commented 15 hours ago

but why are we going through apparently all of the on-chain actors here: https://github.com/filecoin-project/go-state-types/blob/master/migration/util.go#L212

Ok nice this is just code that should not be used for code cid migrations. The run down is this code was written long pre-FVM phase I when all actors we migrated needed state migrations.

I was running it with some debug printing and seeing it touch ~6M of them and I'm not sure why.

I think you are seeing 6M cache hits? If so this seems to make sense given the unnecessary migration of all actor heads. 3M actors + 3M state heads.

This is a great find that I missed. It should give us 2x performance which puts us at parity with Forest which makes the world make more sense. I'm going to work on a PR for this today in case its not too late to get it in.

ZenGround0 commented 15 hours ago

Ok furthermore here's an interesting corallary. -- We don't do any caching of top level actors --

The cachedMigrator around the code cid migrator is purely bad. It costs us an unnecessary cache hit on the state head AND this it does no caching of the migrated actor.

Furthermore I think the only way to correctly cache top level actors is at the level of HAMT nodes since actors are not stored on disk as single objects so we don't save on disk writes by caching them. The correct way to do this is with HAMT diffing mentioned above. Anything else is more work for less payoff.

The only speedup that premigrations give us for a base migration are in writing HAMT nodes to disk already so that the second time around badger doesn't need to actually write it can just noop when we give it a top level actor HAMT node that didnt change.

ZenGround0 commented 15 hours ago

ok -- pretty modest gains, idk if its worth rereleasing:

Before change:

completed premigration, took  34.522901313s
completed round actual (with cache), took  24.012397491s

After change

completed premigration, took  29.261960129s
completed round actual (with cache), took  20.52595699s

15% improvement on the part that matters

This makes sense -- we halved 10s of time accessing go maps and theres still 15s of time reading HAMT from disk that we have to pay. This also makes me wonder how much better the actual migration is in a lotus node with a hot actor HAMT cache from syncing the chain -- probably not a lot since few actors actually change at the head.

BigLep commented 11 hours ago

@ZenGround0 : will you put up a PR for this work so it's there for the future, even if it's not leveraged for nv23 (which is TBD, but I assume we'd include it if we're going to do an additional release anyways)

ZenGround0 commented 11 hours ago

https://github.com/filecoin-project/go-state-types/pull/289

filecoin-project / go-state-types

Optizmize base migrations #285

Analysis

Solutions

Smarter use of GC

Top level HAMT diff

What should we do