Ralith / oddio

Lightweight game audio
Apache License 2.0
150 stars 9 forks source link

SPSC memory is allocated twice #74

Open mgeier opened 2 years ago

mgeier commented 2 years ago

This is probably not a big deal, I just wanted to mention it because it surprised me when I first realized it ...

https://github.com/Ralith/oddio/blob/adc60db7a8c9ccda5a45e5b3a6c06a1a82f8dc25/src/spsc.rs#L197-L202

After allocating all memory, the Box is converted to an Arc which allocates completely new memory and copies the initialized values to it. AFAICT, the originally allocated memory is deallocated again at the end of the new() method.

Is this intentional? Or am I missing something?

I found this when trying to convert my own SPSC ring buffer to a DST (see https://github.com/mgeier/rtrb/pull/75), where this crate's code was very helpful. I avoided the repeated allocation by manually implementing a small subset of Arc for this special case instead of using Arc directly.

Ralith commented 2 years ago

Your analysis is correct, and this behavior is intended. I agree it's kind of ugly. For Frames, another DST implemented by oddio, the standard Arc type was desired as part of the public interface, and I think here I just reused the same approach, also since it saves some extra unsafe and doesn't cost much in the long run. At least when Arc would not otherwise be part of the public interface, I agree that rolling your own refcounting is a defensible approach as well.

I found this when trying to convert my own SPSC ring buffer to a DST (see mgeier/rtrb#75), where this crate's code was very helpful.

I'm happy to hear that! I'm a big fan of rtrb's interface, and if you merge that I'll be out of excuses not to pull it in directly in favor of my reinvented wheel here.

mgeier commented 2 years ago

Thanks for confirming my speculations!

it saves some extra unsafe

Yeah, that's definitely an advantage. My implementation has quite a lot of unsafe.

doesn't cost much in the long run

Well, that's the question, I'm not sure about that.

In general, yes, I care more about the real-time operation than the initialization phase. However, if really big buffers are used for some reason, the allocation overhead might become relevant?

Also, I have an idea to push this even further (but I don't know whether it's a good idea!): One could provide an (unsafe) API to construct a RingBuffer by providing an existing piece of memory to be used. In this case, the dependency to alloc could be dropped (or rather made optional). I have never actually needed this, though. And there haven't been any requests for something like this. So for now this is just a theoretical deliberation.

rolling your own refcounting is a defensible approach as well.

In fact, I can get away with a much simpler approach than actual refcounting. For one, I don't need to implement the equivalent of clone(), because producer and consumer are created at the same time anyway. And I don't really need to count that much, I only need to represent 2 states, so an AtomicBool is sufficient.

if you merge that I'll be out of excuses not to pull it in directly in favor of my reinvented wheel here.

Yeah, I'm still unsure, I'd like to gather more input before moving forward. I want to spend a bit more time on benchmarking, because for now the multi-threaded benchmarks show no change in performance. But maybe the benchmarks themselves could be improved.

However, I've already found a few places where the Rust compiler doesn't seem to optimize away all abstractions, so I had to "lower" a few of them to get competitive performance, see https://github.com/mgeier/rtrb/pull/75/commits/d0764ba29d1954a9cb058ac094c1c80583482ad9. Maybe there are still some optimizations missing ...

Have you experienced clear performance improvements when switching to a DST?

Apart from performance, I think it's more elegant to have UnsafeCell<[MaybeUninit<T>]> instead of *mut T, but it comes at the cost of more code, a lot of it being unsafe.

I'd be happy about any kind of input!