[optimization] Use bytes slice pool for allocations

Wondertan commented 3 years ago

Context

We allocate plain byte slices extensively throughout the project repos and after some work with them discard allocated slices causing GC to clean them up. In some hot paths, like whole block processing flow, this surely causes additional pressure on both allocator and GC. Fortunately, there is a relatively simple trick to avoid that and reduce stress in general for all data allocations by reusing fixed-sized allocated buffers through sync.Pool as basic primitive, though we can rely on already existing libs with simple APIs not to reinvent the wheel.

Expected Gains

Stop-the-world takes much less CPU time.
Much less RAM usage for the long run execution, but in some cases can be observed in some e2e testing as well.
Less pressure on allocators, as mentioned already.

I would like to provide real numbers, but this is very application-specific, so for us, they could be very different. For my previous project, htop showed almost 2x RAM usage reduction after applying this to multiple places.

Options:

https://github.com/libp2p/go-buffer-pool

Has fewer stars and less used within Go comm, but has a smarter API. Actually, we already using it as we rely on IPFS/libp2p, thus using it directly would allow us to share allocated bytes within IPFS and LL code, what is generally good. I would just stick with this one.
https://github.com/valyala/bytebufferpool

From the gopher who created fasthttp. BTW, he is from Kyiv and I know him personally. He is obsessed with optimizations.

Places to be altered

[ ] Rsmt2d
[ ] NMT
[ ] Un/marshalling of core data types
[ ] Reactor's msg un/marshalling

pls add more

Wondertan commented 3 years ago

Theissue is related to all sublibs we have in LL's org, but AFAIK we don't have an established process on how to create such root issues, so I decided to keep it in the core.

Wondertan commented 3 years ago

@liamsi, pls attach it to the proper Project. Don't know what to choose

liamsi commented 3 years ago

As all this sounds good but it's mostly about optimizations, I've attached this to Testnet. We could even do some of this after mainnet though.

AFAIK we don't have an established process on how to create such root issues, so I decided to keep it in the core.

For similar cases, that span multiple repositories, I've used cards in the projects (e.g. see the first entry in https://github.com/orgs/lazyledger/projects/4 or in https://github.com/orgs/lazyledger/projects/6).

In some hot paths, like whole block processing flow, this surely causes additional pressure on both allocator and GC.

IMO, the first step should be to actually measure this via a profiler and benchmarks to see if that actually has any impact. Especially compared to overall systems latency. I'd expect the main bottleneck will be IO / network comms (it always is). And the gains through low-level optimizations like the mentioned ones are often rather minuscule in the grand scheme of things. We also need to measure network latencies via our ipld experiments (or via that testframework protocol labs developed). I think measuring latencies for retrieving block data as well as sampling da proofs has much higher priority (e.g. see: https://github.com/lazyledger/ipld-plugin-experiments/issues).

liamsi commented 3 years ago

rsmt2d is a bit different with this regard, as we are currently refactoring it to streamline the APIs and make it production-ready. So if you have good ideas there, you can sync with @adlerjohn who has recently added benchmarks there.

Wondertan commented 3 years ago

For the new rsmt2d API we can add something like Close/Destruct/Free and etc semantics for a user to free(get back to the pool) resources once processing ended.

Wondertan commented 3 years ago

As all this sounds good but it's mostly about optimizations

Totally agree, but It would hurt to file an issue for the future.

IMO, the first step should be to actually measure this via a profiler and benchmarks to see if that actually has any impact.

But measuring differences would require us to implement that first, so it hard to understand the exact numbers of gains without doing actual work here.

I'd expect the main bottleneck will be IO / network comms (it always is)actually to implement that first

This is surely not about bottlenecks but better resource management within the application with predictable gains. Imagine how many times we allocate fixed-sized slices for samples every block when we can reuse them by allocating them once, hiding from GC in sync.Pool after some processing, and then accessing again when a new block arrives. And there are other cases like this. Unfortunately, we cannot fully manage memory in Go ourselves(btw, I would love to see toggleable GC in Go). Still, we are smart enough to overcome some of the disadvantages of GC.

liamsi commented 3 years ago

But measuring differences would require us to implement that first, so it hard to understand the exact numbers of gains without doing actual work here.

Yeah, you are right about measuring the difference of course. I'm just saying that we should first profile how much time and how many allocs happen because of GC (of slices) first.

It's certainly good that you opened an issue about it!

tzdybal commented 3 years ago

"Premature optimization is the root of all evil". Definitely good place to find some CPU cycles and memory, but I totally support @liamsi on this - we probably can do it even after mainnet.

liamsi commented 2 years ago

~~Is this still relevant? I think @Wondertan went and did this in node anyways?~~ Moved over to node.

Bidon15 commented 2 years ago

Grooming 12/07/2022:

This could be a part of a later epic describing the RAM optimization roadmap

celestiaorg / celestia-node