Benchmarks - Githubissues

cpud36 commented 1 year ago

It is unclear when this crate performs better than Vec<Box<dyn Trait>> approach. Having benchmarks would outline possible use cases.

From what I can see, makes sense to compare the following approaches:

Naive Vec<Box<dyn Trait>>
FuseBox itself
Using Vec<&dyn Trait> with bumpalo as an allocator(or any other slab/arena allocator).

My hunch is that:

All approaches have equivalent perf on random access
Approach 2 is faster than approach 1 for sequential iteration, but is slightly slower than approach 3
Approaches 1 and 3 are the fastest for sequential iteration for ZST

JohnDowson commented 1 year ago

Do you have any ideas for benchmark-able workloads?

JohnDowson commented 1 year ago

So far it seems like Vec is a bit faster in linear iteration, but FuseBox is a bit faster in random access. Which is kinda weird.

cpud36 commented 1 year ago

It seems your malloc is pretty good.

Essentially Header::offset is not that different from a pointer. To get to the data, one has to first index into the headers vector and then index into inner pointer/array. And the array index operation consists of ptr offset and ptr deref. So FuseBox is still prone to double indirection problem.

In linear iteration bench, both FuseBox and Boxes seem to have allocated the data continuously. FuseBox uses custom arena allocator, so this is to be expected. For Boxes malloc just did a pretty good job in this case(for allocations of similar size some malloc impls use essentially a global arena).

The performance penalty FuseBox took in simply due to the Header being slightly larger than fat ptr. A Box<dyn Trait> stores a pointer to the data and a pointer to vtable, 16 bytes total. The Header is 24 bytes total(an extra size field). So for FuseBox there is more cache misses simply due to headers vec being larger.

For starter, you could remove Header::size field, and instead store only the size of the last element. This should improve linear iteration performance.

P. S. I am puzzled by random access performance

NobodyXu commented 1 year ago

It seems your malloc is pretty good.

Could try musl target, they provide absolutely the worst allocator performances out-of-the-box, which is also the default on Alpine.

Also, maybe it's a good idea to mix different trait objects?

JohnDowson commented 1 year ago

For starter, you could remove Header::size field, and instead store only the size of the last element. This should improve linear iteration performance.

The benchmarks I did were done after I removed size from Header.

Here's a plot of performance in random access as it is now: (inlinemeta fusebox is a version of fusebox { offsets: Vec<usize>, inner: *mut (T, Metadata) })

Figure_1

JohnDowson commented 1 year ago

For completeness' sake here's linear access performance:

Figure_2

JohnDowson commented 1 year ago

And musl just makes everyone perform slightly worse.

Musl random access: Random_musl

Musl linear iteration: Linear_musl

JohnDowson commented 1 year ago

Tried Vec<&'bumpalo mut dyn Trait>.
Confusingly, it is even worse than naive Vec<Box> in random access, but ever so slightly faster in linear access. Random: Random Linear: Linear

JohnDowson / fusebox

Benchmarks #1