<Vec<u8> as HeapSize>::heap_size() is slow

exFalso commented 1 year ago

I'm not quite sure how to tackle this, but a Vec<u8> cache entry is surprisingly expensive by default because of the heap_size() call literally iterating over every byte inside.

I worked around it by wrapping and implementing HeapSize with just vec.capacity(), but this required some perf benchmarking. Ideally the user wouldn't have to think about this (I might add that Vec<u8> is probably one of the most common cache entries).

I wonder whether there is some sort of type-level magic with associated types which can force the right behaviour but also allows for the generic Vec<T> implementation?

florian1345 commented 1 year ago

Thanks for reporting the issue! You are right that Vec<u8> should not have a linear-time HeapSize implementation. I trusted compiler optimizations to achieve good performance in order not to over-complicate the code.

The problem seems to be occur with opt-level < 3, as good performance requires inlining the HeapSize and MemSize implementations as well as figuring out that items.into_iter().map(|_| 0).sum() is 0. With opt-level = 3, the compiler seems to be able to do that.

Regarding your use case, is the performance insufficient even for debug builds? If so, there may indeed be some elegant solutions, however without specialization they may require some unintuitive design. My current idea would be default-implemented methods on HeapSize that take as input impl Iterator<Item = &Self> and impl ExactSizeIterator<Item = &Self> for 0-/constant-sized types respectively, which would be non-breaking. If you were hinting at something particular with "associated types", I would be interested to hear any alternatives.

exFalso commented 1 year ago

Interesting, we are using --release which has a default opt-level of 3, but heap_size() was just as slow. We did have the Vec wrapped in other structures, perhaps that's why the inlining failed.

How would the default methods be non-breaking? Do you mean the default methods are added to aid the user in implementing a faster HeapSize?

With type-level magic I was thinking of some kind of type-based selector for "explicit" specialization, but now toying around with it, I think that would only work if the user marked the structures excplicitly. E.g: https://gist.github.com/rust-play/1352e3891e15756710d68c834da0a2b9

Ultimately, without a wrapper the element type-based dispatch requires either specialization or runtime dispatch...

florian1345 commented 1 year ago

Hmm it is possible I messed up the benchmark and it optimized more than normally possible. I will have to investigate, but either way I think it would be better to have an acceptably fast debug mode (heap_size for 256 MiB vectors took ~3s on my machine, which I deem annoying even for debug mode).

The extra methods would be added to the HeapSize trait itself and call the heap_size method for their default implementation. Types with constant heap size could then override the default implementation with a more efficient approach, e.g.

trait HeapSize {
  fn heap_size(&self) -> usize;

  fn heap_size_iter(iter: impl Iterator<Item = &Self>) -> usize {
    iter.map(|item| item.heap_size()).sum()
  }
}

impl HeapSize for u8 {
  fn heap_size(&self) -> usize {
    0
  }

  fn heap_size_iter(iter: impl Iterator<Item = &Self>) -> usize {
    0
  }
}

Then the [u8] implementation could use heap_size_iter to efficiently compute the size of all of its items. Thinking about it, it could in theory be breaking if the downstream crate calls some heap_size_iter from another trait on a type which also implements HeapSize, which I find unlikely.

The KnownHeapSize solution also looks nice and I think would be a good candidate if it could be used with specialization. As it stands, having to wrap the vector in a KnownElementHeapSize<_> feels a bit unergonomic. However, I like the simplicity of specifying a constant heap size and having efficient implementations of various HeapSize types be derived from that. I have to think about whether some parts of this approach can be used.

florian1345 commented 1 year ago

The issue with my benchmarks was that I did not cleanly separate opt-level = 3 with lto = true. In fact, the latter is responsible for the necessary optimizations. If you prefer that over your current workaround, you can put lto = true (with at least opt-level = 2) into your build profile while I am working on a more reliable fix. It will probably increase build-time, however.

exFalso commented 1 year ago

Oh interesting, I didn't know lto is switched off by default, I'll enable this for our release builds. Currently we have an explicit implementation that uses vec.capacity() which works well too.

I think the heap_size_iter solution would work well, and the corner cases wrt backwards compat are probably super rare. Perhaps still warrants a version bump.

florian1345 commented 1 year ago

I just released v0.3.0 which should resolve this issue. If you are still experiencing low performance, please re-open the issue and share details.

florian1345 / lru-mem

<Vec<u8> as HeapSize>::heap_size() is slow #40