Closed exFalso closed 1 year ago
Thanks for reporting the issue! You are right that Vec<u8>
should not have a linear-time HeapSize
implementation. I trusted compiler optimizations to achieve good performance in order not to over-complicate the code.
The problem seems to be occur with opt-level < 3
, as good performance requires inlining the HeapSize
and MemSize
implementations as well as figuring out that items.into_iter().map(|_| 0).sum()
is 0. With opt-level = 3
, the compiler seems to be able to do that.
Regarding your use case, is the performance insufficient even for debug builds? If so, there may indeed be some elegant solutions, however without specialization they may require some unintuitive design. My current idea would be default-implemented methods on HeapSize
that take as input impl Iterator<Item = &Self>
and impl ExactSizeIterator<Item = &Self>
for 0-/constant-sized types respectively, which would be non-breaking. If you were hinting at something particular with "associated types", I would be interested to hear any alternatives.
Interesting, we are using --release
which has a default opt-level of 3, but heap_size()
was just as slow. We did have the Vec wrapped in other structures, perhaps that's why the inlining failed.
How would the default methods be non-breaking? Do you mean the default methods are added to aid the user in implementing a faster HeapSize?
With type-level magic I was thinking of some kind of type-based selector for "explicit" specialization, but now toying around with it, I think that would only work if the user marked the structures excplicitly. E.g: https://gist.github.com/rust-play/1352e3891e15756710d68c834da0a2b9
Ultimately, without a wrapper the element type-based dispatch requires either specialization or runtime dispatch...
Hmm it is possible I messed up the benchmark and it optimized more than normally possible. I will have to investigate, but either way I think it would be better to have an acceptably fast debug mode (heap_size
for 256 MiB vectors took ~3s on my machine, which I deem annoying even for debug mode).
The extra methods would be added to the HeapSize
trait itself and call the heap_size
method for their default implementation. Types with constant heap size could then override the default implementation with a more efficient approach, e.g.
trait HeapSize {
fn heap_size(&self) -> usize;
fn heap_size_iter(iter: impl Iterator<Item = &Self>) -> usize {
iter.map(|item| item.heap_size()).sum()
}
}
impl HeapSize for u8 {
fn heap_size(&self) -> usize {
0
}
fn heap_size_iter(iter: impl Iterator<Item = &Self>) -> usize {
0
}
}
Then the [u8]
implementation could use heap_size_iter
to efficiently compute the size of all of its items. Thinking about it, it could in theory be breaking if the downstream crate calls some heap_size_iter
from another trait on a type which also implements HeapSize
, which I find unlikely.
The KnownHeapSize
solution also looks nice and I think would be a good candidate if it could be used with specialization. As it stands, having to wrap the vector in a KnownElementHeapSize<_>
feels a bit unergonomic. However, I like the simplicity of specifying a constant heap size and having efficient implementations of various HeapSize
types be derived from that. I have to think about whether some parts of this approach can be used.
The issue with my benchmarks was that I did not cleanly separate opt-level = 3
with lto = true
. In fact, the latter is responsible for the necessary optimizations. If you prefer that over your current workaround, you can put lto = true
(with at least opt-level = 2
) into your build profile while I am working on a more reliable fix. It will probably increase build-time, however.
Oh interesting, I didn't know lto is switched off by default, I'll enable this for our release builds. Currently we have an explicit implementation that uses vec.capacity()
which works well too.
I think the heap_size_iter
solution would work well, and the corner cases wrt backwards compat are probably super rare. Perhaps still warrants a version bump.
I just released v0.3.0 which should resolve this issue. If you are still experiencing low performance, please re-open the issue and share details.
I'm not quite sure how to tackle this, but a
Vec<u8>
cache entry is surprisingly expensive by default because of the heap_size() call literally iterating over every byte inside.I worked around it by wrapping and implementing HeapSize with just
vec.capacity()
, but this required some perf benchmarking. Ideally the user wouldn't have to think about this (I might add thatVec<u8>
is probably one of the most common cache entries).I wonder whether there is some sort of type-level magic with associated types which can force the right behaviour but also allows for the generic
Vec<T>
implementation?