blockchain, cmd, netsync, main: Add utxocache

kcalvinalvin commented 1 year ago

Updated the PR and is now very different from #1373.

Main differences are:

No utxoviewpoint middleman during headers-first sync

In #1373, the view is still used as a middleman. The flow for doing block validation would be:

1. Make a new UtxoViewPoint.
2. Load the relevant inputs into the UtxoViewPoint from the cache.
3. Connect transactions in the UtxoViewPoint
4. Apply the connections to the cache from the UtxoViewPoint
5. Flush the cache to the database when needed

The problems with these approach were: (2) is extremely expensive as there a lot of memory allocated by creating a whole new set of key/value pairs of the existing UTXOs. (4) introduces a new code path from [view -> cache] and in turn introduces a whole new set of bugs.

The new code is different where the flow during headers-first (blocks up until the checkpoint) is:

1. Connect transactions in the cache
2. Flush the cache to the database when needed

This flow is much simpler and faster than the previous code.

For when the node is not in headers-first mode:

1. Make a new UtxoViewPoint
2. Load the relevant inputs into the UtxoViewPoint from the cache.
3. CheckConnectBlock with the UtxoViewPoint
4. Connect transactions in the cache
5. Flush the cache to the database when needed

This flow still has the overhead of (2) but no longer has the new bugs introduced by applying the transaction connections from [view -> cache]. Instead, the view is only used as a temporary mutable map just for checkConnectBlock and the txs are connected in cache just like in headers-first mode.

No use of the cache during reorganizations

During reorganizations, a whole new possible bug was introduced where if the node is unexpectedly shutdown, the disconnected txs may only have been updated in the cache and not in the database. The new code forces a flush before the reorganization happens and keeps the same code as before, eliminating the new case of bugs introduced in #1373.

This allowed the removal of TestUtxoCache_Reorg and all the associated (and duplicated) code.

No more utxoStateConsistency struct

It was enough to just save the block hash and not the single byte indicating if the cache is consistent or not. All the related code was removed.

Map slice instead of a singular map

The singular map doubles in size when the map gets full. For bigger maps, this may result in us going over the promised memory limit. The mapslice allows us to allocate multiple maps with differing sizes, allowing acute control of the memory allocated memory.

Another solution to this would be to just force the user to only allow fixed amounts of memory for the utxo cache (1GB, 2GB, 4GB, 8GB, etc). This would get rid of the code in sizehelper.go and mapslice related code in utxocache.go but wouldn't allow the user flexibility. I'm ok with either route. This would get rid of 491 lines of code from being added.

Known issue

During a flush, there's a large allocation of memory due to the fact that database/ffldb will hold onto every single bucket write to guarantee atomicity. It'll usually result in 2x the memory from what the cache is using. If the cache is using up 2GB, ffldb will allocate 2GB as well, resulting in 4GB of memory being taken up by btcd. I don't think there's any easy way around this other than making a different code path for utxo cache flushes.

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 5152398437

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report. To ensure accuracy in future PRs, please see these guidelines. A quick fix for this PR: rebase it; your next report should be accurate.

490 of 651 (75.27%) changed or added relevant lines in 11 files are covered.
245 unchanged lines in 5 files lost coverage.
Overall coverage increased (+0.2%) to 55.522%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
config.go	0	1	0.0%
blockchain/sizehelper.go	84	87	96.55%
netsync/blocklogger.go	0	4	0.0%
blockchain/utxoviewpoint.go	31	36	86.11%
netsync/manager.go	0	8	0.0%
server.go	0	10	0.0%
blockchain/chainio.go	27	38	71.05%
blockchain/chain.go	43	58	74.14%
blockchain/utxocache.go	290	394	73.6%
<!--	Total:	490	651	75.27%	-->

Files with Coverage Reduction	New Missed Lines	%
peer/peer.go	7	73.2%
database/ffldb/db.go	12	92.64%
netsync/manager.go	18	0.0%
blockchain/chainio.go	48	61.1%
blockchain/chain.go	160	71.37%
<!--	Total:	245	-->

Totals
Change from base Build 5075515644:	0.2%
Covered Lines:	27104
Relevant Lines:	48817

💛 - Coveralls

Roasbeef commented 1 year ago

Giving this a shot for a mainnet sync, will report back if it gets to a higher height than referenced above (I think the past I ran into an issue at that height too).

Running with the following config:

 GOGC=50 btcd --txindex --nopeerbloomfilters --maxpeers=8 --utxocachemaxsize=10000

Roasbeef commented 1 year ago

Had to restart my sync as had some memory pressure limits incorrectly configured (though surprised it was killed given the box has 32 GB of RAM and was running with a 10 GB cache size).

I also think the current cache size estimate might be significantly undercounting the true over head. Will do some profiles to dig into things.

Roasbeef commented 1 year ago

Heap profile after around restart with 3 GB of cache used (note how newUtxoCache seems to report double that):

Screenshot 2023-03-08 at 11 37 02 AM

Aside from cache, can see that memory building up for: cloning the cache entries, and also just decoding transactions themselves. Re tx decoding, we have some PRs up that attempt to reduce garbage during that process, and also cache the serialization to make hashing faster (but this is still in check pointed land), but ofc one task at a time.

Logs of cache size+speed:

2023-03-08 11:35:47.871 [INF] SYNC: Processed 335 blocks in the last 10.05s (264489 transactions, height 339095, 2015-01-15 11:34:26 -0800 PST, ~2751 MiB cache)
2023-03-08 11:35:57.873 [INF] SYNC: Processed 436 blocks in the last 10s (284726 transactions, height 339531, 2015-01-18 11:31:10 -0800 PST, ~2816 MiB cache)
2023-03-08 11:36:07.877 [INF] SYNC: Processed 428 blocks in the last 10s (304451 transactions, height 339959, 2015-01-21 13:41:44 -0800 PST, ~2868 MiB cache)
2023-03-08 11:36:17.894 [INF] SYNC: Processed 406 blocks in the last 10.01s (279593 transactions, height 340365, 2015-01-24 16:50:34 -0800 PST, ~2912 MiB cache)

Roasbeef commented 1 year ago

Another profile taken at ~7GB cache size:

Screenshot 2023-03-08 at 11 50 56 AM

Roasbeef commented 1 year ago

One other thing I notice due to the cache design (purge everything once full, instead of just evicting on the fly) is that post purge, there's a significant bump in I/O to refill the cache we guarantee that the very next block triggers thousands of cache misses. The machine I'm running on has a pretty fast SSD (Samsung 980 Pro M.2 Gen 4) but validation still sees a tangible drop off due to the increased I/O.

An alternative strategy would be something like:

Keep the normal height/age based eviction in place
Once persistently at capacity, start to evict very old UTXOs in the background to make space for new items
- This can be done in a batched manner

Roasbeef commented 1 year ago

CPU profile shows most of the time is spent on GC (this is the default GOGC setting)

Screenshot 2023-03-08 at 12 40 15 PM

Roasbeef commented 1 year ago

I think the other issue with this design is the flushing. Rather than incrementally evict items from the cache to write back on disk, we end up flushing the entire cache back to disk all at once. For every large caches, I've seen this take up to 20 minutes to accomplish (still flushing as I type this), during which, the entire block download is stalled.

Post flush, looks like the runtime spends a lot of time trying to free the GBs allocated back to the OS: Screenshot 2023-03-08 at 5 33 27 PM

kcalvinalvin commented 1 year ago

Yeah I've noticed the slowdowns during flushes as well. Memory allocation during flushes were a big slowdown for me as well.

Here's a snippet of my log during ibd (6GB of cache, GOGC=100, memlimit of 30GB):

2023-03-02 12:55:08.409 [INF] SYNC: Processed 46 blocks in the last 10.05s (91607 transactions, height 674986, 2021-03-17 14:55:10 +0900 KST, ~5975 MiB cache)                                                                                                                                               
2023-03-02 12:55:18.452 [INF] SYNC: Processed 59 blocks in the last 10.04s (123815 transactions, height 675045, 2021-03-18 00:35:34 +0900 KST, ~5997 MiB cache)                                                                                                                                              
2023-03-02 12:55:19.168 [INF] CHAN: Flushing UTXO cache of ~6001 MiB to disk. For large sizes, this can take up to several minutes...                                                                                                                                                                        
2023-03-02 13:08:44.185 [INF] SYNC: Processed 6 blocks in the last 13m25.73s (14146 transactions, height 675051, 2021-03-18 02:05:39 +0900 KST, ~0 MiB cache)

Heap profile after around restart with 3 GB of cache used (note how newUtxoCache seems to report double that):

Was this taken with the flag --profile? I'll check up on the size calculations with the utxo cache.

Keep the normal height/age based eviction in place Once persistently at capacity, start to evict very old UTXOs in the background to make space for new items This can be done in a batched manner

So for the Utreexo project, we did some testing on efficient methods of caching utxo entries and @adiabat + @cbao1198 had concluded that evicting old entries after the threshold you set is reached. There's a relevant github discussion here https://github.com/mit-dci/utreexo/discussions/325 but iirc there were graphs generated to visualize these. I'll look into the evicting method first since there's some evidence of it being good.

abitrolly commented 1 year ago

Can Go release memory immediately after a variable is released?

It looks like there are two caches here. 1 cache TX lookup. 2 buffer for writing to disk.

The lookup cache can be a FIFO queue, where items are moved if they are being hit. The oldest part of FIFO is written to the disk periodically. There could be good static memory structure for such data that doesn't require GC. Is that what eviction means?

kcalvinalvin commented 1 year ago

Verified that the memory calculation is wrong. Working on a fix and will push a fix soon.

kcalvinalvin commented 1 year ago

Can Go release memory immediately after a variable is released?

With arenas I believe you could. But it wouldn't be for a single variable.

The lookup cache can be a FIFO queue, where items are moved if they are being hit. The oldest part of FIFO is written to the disk periodically. There could be good static memory structure for such data that doesn't require GC. Is that what eviction means?

Yeah basically.

Roasbeef commented 1 year ago

Was this taken with the flag --profile? I'll check up on the size calculations with the utxo cache.

Yeah, then using this command: (heap)

go tool pprof --http=localhost:8083 http://localhost:5050/debug/pprof/heap

(cpu)

go tool pprof --http=localhost:8083 http://localhost:5050/debug/pprof/profile

So I was able sync all the way up to current day!

I don't have a good sync time, as I had to restart a few times, before settling into a 5 GB cache (w/ the default GOGC) settings.

I'll do another run now, with the latest version (after the fix) and this tuned cache size.

Roasbeef commented 1 year ago

The lookup cache can be a FIFO queue, where items are moved if they are being hit

A hit is basically a deletion: if you're looking up a UTXO, it's because it's been spent in the first place. The cache as is, lets us avoid unnecessary random writes: UTXOs that die young are never written to disk. However, "young" here is relative to the cache size. Once the cache overflows, we write it all to disk, then start from scratch, which guarantees a cold cache (I/O spikes after the purge, as all UTXOs need to be loaded in).

Re GC optimization:

Today we allocate a flat map for each item in the cache: https://github.com/btcsuite/btcd/blob/6c278f4b04463648ccba3dfeede4ee7b4e4af426/blockchain/utxocache.go#L245
When we go to add an item, we end up allocating the entry on the heap, then adding it to the cache.
This causes a lot of active pointers that need to be scanned by the GC, this increases as a function of the cache size. Eventually this dominates GC time as after a purge, the GC must remove all these, while we're also adding new items.

One approach (basically a middle ground to the Arena, but also complementary to it potentially), would be to pre-allocate al the UTXO entries in a single swoop, in a single slice. This would allocate all the entries to contiguous (virtual) memory. We'd keep an index of the last unallocated (not set to anything) entry in this slice. We'd then use this to create new UTXO entries, as adding a new entry will just be adding a pointer to the map, instead of actually allocating. Assuming we keep the current purge behavior, then then entire slice (in theory) can be de-allocated all at once, as all the refs in the map have been deleted and also the main slice pointer to longer held onto.

Going further, we could then also actually never de-allocate the slice. We'd just reset that pointer and know that anything "after" that is actually blank. This is more or less the "memory ballast" approach outlined here: https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-how-i-learnt-to-stop-worrying-and-love-the-heap/.

IIUC, this has basically been subsumed by the arena allocator, as it gives you a nicer API (that uses generics).

Re this:

Once persistently at capacity, start to evict very old UTXOs in the background to make space for new items

I think the reason the cache is designed the way it is rn (copied from bitcoind) was to enable recovery from unclean shutdowns (to avoid a cache flush each block). I need to revisit that design to properly analyze it, but maybe it turns out this isn't so critical after all. One other side effect of this is:

If you set a cache size of 10 GB, then do an unclean shutdown, then do 5 GB, on start up in order to recover the cache size swells back to 10 GB.

Also this flag (env GODEBUG=gctrace=1) can be very helpful in optimization. As it shows what happens with each GC pause, and how long the runtime spends on that.

kcalvinalvin commented 1 year ago

Found a bug in utxoCache.InitConsistentState() where if you signal to quit while the utxo cache is rolling back, it'll reject the next block after the tip and cause all the blocks afterwards to be recognized as orphans.

Working towards a fix.

Roasbeef commented 1 year ago

cc @halseth

kcalvinalvin commented 1 year ago

The last 3 commits alleviate the InconsistentState bug that I've noticed. The PR is ready for review.

halseth commented 1 year ago

Posted some initial comments while familiarizing myself with the changeset 😄

kcalvinalvin commented 1 year ago

Addressed all comments by @halseth except for https://github.com/btcsuite/btcd/pull/1955#discussion_r1162673185 in the latest commits

kcalvinalvin commented 1 year ago

Converted to draft because I see that the PR itself isn't trivial to review. I saw some things that could be refactored away (ex:utxoView interface isn't needed) and the git commits aren't doing any favors for the reviewers.

I'll push something more readable in the next couple of days.

halseth commented 1 year ago

Converted to draft because I see that the PR itself isn't trivial to review. I saw some things that could be refactored away (ex:utxoView interface isn't needed) and the git commits aren't doing any favors for the reviewers.

I'll push something more readable in the next couple of days.

Tag me here when done and I'll review it again 😀

kcalvinalvin commented 1 year ago

Ended up doing a lot of refactor with git commits that are clearer and easier to review. There are performance gains from the previous implementation as well.

EDIT: Oh no sorry there's more stuff I can purge from the utxocache code...

kcalvinalvin commented 1 year ago

@halseth @Roasbeef

Ready for reviews now!

saubyk commented 11 months ago

Hi @halseth @Roasbeef this pr is ready for the next round of reviews.

kcalvinalvin commented 11 months ago

Did you try running this compiled with the ´-race` flag to ensure concurrent access is safe?

Am ibd-ing right now with a binary that was compile with the -race flag. It's ok so far. I'll let it catch up and let it run once it's caught up as well.

kcalvinalvin commented 10 months ago

Additional things pushed:

Rebased on top of master
Added 3 new commits to allow for pruning to work with the utxo cache code.
Added an extra test in TestUtxoCacheFlush to check that the utxo cache flushes for FlushMode.FlushPeriodic

kcalvinalvin commented 9 months ago

EDIT: Untested. Could you post some numbers of IBD using the latest changeset?

Sure, I'll post some numbers+flamegraphs.

Addressed the requested changes and re-checked that the older requested changes are addressed.

kcalvinalvin commented 8 months ago

While generating the numbers + flamegraphs, I found two bugs which are fixed and now tested by the latest push.

1: There would be a concurrent map access error with the mapslice after the node has caught up. This has been addressed with a lock on the utxocache struct. Added a test for this bug.

2: There was a bug where the flush wouldn't happen during a prune as they were both accessing the database. The code now has been changed so that you have to pass in a database.Tx. Added a test for this bug.

3: Added a test for InitConsistentState just for good measure as I saw it wasn't in the test coverage.

kcalvinalvin commented 8 months ago

Synced up to block height 801867 with the following command: ./btcd --datadir=. --logdir=. --whitelist=127.0.0.1 --listen=:9333 --rpclisten=:9334 --rpcuser=calvin --rpcpass=calvin --addpeer=127.0.0.1 --prune=400000 --utxocachemaxsize=16000 --cpuprofile=cpuprof-1 --memprofile=memprof-1

Time taken: 20 hours, 9 minutes, 33 seconds Time taken up til the last checkpoint at height 751,565: 11 hours, 10 minutes, 31 seconds

My computer: CPU: Ryzen 3600, Memory: 48GB, 1tb sandisk ssd.

The local peer is a bitcoin core node and I think sometimes the btcd node is slowed down because the core node just isn't serving blocks quick enough. Not sure if this is rate limiting by the core node.

Cpu profile

Most of the time taken was with GC and signature checking code.

Memory profile

The 2 most costly allocations were for downloaded blocks and tx copying during sighash calculation.

Raw profiles are provided here: profiles.tar.gz

halseth commented 8 months ago

Nice! How does it compare to the non-UTXO case on the same setup?

Would connection to multiple (real) peers make a difference?

kcalvinalvin commented 8 months ago

Nice! How does it compare to the non-UTXO case on the same setup?

Oh I took the numbers a few months back but I can't find the numbers now. It was over 24 hours. I can get the numbers again if needed.

Would connection to multiple (real) peers make a difference?

In my testing no since I pointed it at my local node. In a real world setting, it would matter but since there isn't code for parallel block downloads at the moment, there wouldn't be a different with this code.

Roasbeef commented 7 months ago

Ok I ran a sync and got to chain tip in ~16 hours:

2023-11-15 00:39:16.139 [INF] BTCD: Version 0.23.3-beta
2023-11-15 16:34:35.923 [INF] SYNC: Processed 18 blocks in the last 12.13s (70918 transactions, height 816940, 2023-11-15 15:07:42 -0800 PST, ~6272 MiB cache)

I ran with a 16 GB cache.

Spec are:

Intel® Core™ i9-13900K (Cores 24, Threads 32)
32 GB RAM
2TB NVMe PCIe Gen 4 M.2

That's pretty damn good! Similar to the above, the thing that takes the most time when it gets to modern day blocks is the serialization of the transaction over and over again. I have this PR that resolves that, it has a bug somewhere though where something is mutated but the value isn't updated (only fails in a unit test).

Similar to @kcalvinalvin above, it's syncing directly against a local bitcoind. My internet isn't super faster, and we already know a single slow peer can stall the entire download as we don't have parallel block download implemented, so I think this can give us a more accurate picture.

I'm running another sync test against master, so we can compare.

EDIT: At the 18 hour mark, master is still in 2020.

EDIT 2: Master finished in 43 hours.

kcalvinalvin commented 7 months ago

That's pretty damn good! Similar to the above, the thing that takes the most time when it gets to modern day blocks is the serialization of the transaction over and over again. I have this PR that resolves that, it has a bug somewhere though where something is mutated but the value isn't updated (only fails in a unit test).

I think #2023 is a better optimization since we can just use the serialized tx directly from the block. We already have the serialized tx as a part of the block so just reusing that makes a lot of sense to me.

Roasbeef commented 7 months ago

I think https://github.com/btcsuite/btcd/pull/2023 is a better optimization since we can just use the serialized tx directly from the block.

Makes sense, my version also only saved the non segwit encoding, but over time we'll also need the segwit encoding as well. Can you remove it from draft so we can run CI on it?

Roasbeef commented 7 months ago

Also one other diff with that PR is that things are cached only after a call to block.Bytes(), while the earlier one caches at basically as soon as things are read off the wire.

kcalvinalvin commented 7 months ago

Also one other diff with that PR is that things are cached only after a call to block.Bytes(), while the earlier one caches at basically as soon as things are read off the wire.

Out of draft now!

kcalvinalvin commented 6 months ago

Rebased and addressed the comments from roasbeef

kcalvinalvin commented 6 months ago

All fixup commits squashed now in the latest push.

btcsuite / btcd