crossbeam-rs / crossbeam

Tools for concurrent programming in Rust
Apache License 2.0
7.41k stars 468 forks source link

SegQueue::push() ... pop() keeps allocating. #398

Open ralfbiedert opened 5 years ago

ralfbiedert commented 5 years ago

Background

To use rayon in game engine code, I was investigating why ThreadPool::install was allocating every invocation. It turned out there are two sources of allocation,

Observation

If i run this simple test app:

let x: SegQueue<u32> = SegQueue::new();

for i in 0..100 {
    measure("push_pop", || {
        x.push(1);
        x.pop();
    });
}

I receive the following allocation events:

"push_pop" -- Events: 1, Bytes: 504
"push_pop" -- Events: 0, Bytes: 0
...
... 32 invocations later ...
... 
"push_pop" -- Events: 0, Bytes: 0
"push_pop" -- Events: 1, Bytes: 504

The allocation is caused by these lines:

if offset + 1 == BLOCK_CAP && next_block.is_none() {
    next_block = Some(Box::new(Block::<T>::new()));
}

Which, in turn is caused by the tail just growing:

offset 0 ... tail 0
offset 1 ... tail 2
offset 2 ... tail 4
offset 3 ... tail 6
[...]
offset 29 ... tail 58
offset 30 ... tail 60
offset 0 ... tail 64

Issue

I have to admit that I'm not 100% if this is really a bug, or just a "feature" of the algorithm, but I would naively expect that a loop of q.push(x); q.pop(); would not cause the SegQueue to reallocate.

Chopinsky commented 5 years ago

So this is almost certainly what the algorithm expects: when a Block is exhausted (of the free space), it will allocate a new block for the incoming data. That's where the magic number 32 comes into play -- it's the size of a Block, and even if we ignore the pop action in your test, we would still need to allocate regardless of the pop action, as SeqQueue won't reuse the existing blocks.

Alternatively, this issue can be mitigated by putting the destroyed blocks in a "junkyard", then pull them out when we need a new Block. However, this could cause security issues (poped data are still around somewhere), and it can be a headache to maintain the junkyard as well (extra overhead that may very possibly dwarf this allocation problem).

cuviper commented 5 years ago

It seems wise to keep some available capacity when possible. A more general case could be made whenever it pops to an empty head block and there are N remaining full blocks (here N = 0) -- making that block available for new tail pushes might be nice. Then a single pop(); push() should never need to allocate.

schets commented 5 years ago

A while ago I experimented with something like the junkyard idea, and even for a much simpler mpsc situation it was significantly more expensive in the average case than allocating.

I remember some work being done here on an mpmc unbounded queue which would grow in a fashion similar to a vector, albeit with massive latency tails whenever a resize happened.

On Tue, Jul 2, 2019 at 12:00 PM Jacob Zuo notifications@github.com wrote:

So this is almost certainly what the algorithm expects: when a Block exhausted (of the previously held data), it will allocate a new block for the incoming data. That's where the magic number 32 comes into play -- it's the size of a Block, and even if we ignore the pop action in your test, we would still need to allocate regardless of the pop action, as SeqQueue won't reuse the existing blocks.

Alternatively, this issue can be mitigated by putting the destroyed blocks in a "junkyard", then pull them out when we need a new Block. However, this could cause security issues (poped data are still around somewhere), and it can be a headache to maintain the junkyard as well (extra overhead that may very possibly dwarf this allocation problem).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/crossbeam-rs/crossbeam/issues/398?email_source=notifications&email_token=AA4M3CJUJINAERNB7UITXR3P5N3S3A5CNFSM4H4VF4UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZBYC6A#issuecomment-507740536, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4M3CPL4QBQXGJ6I6EMJO3P5N3S3ANCNFSM4H4VF4UA .

Chopinsky commented 5 years ago

A while ago I experimented with something like the junkyard idea, and even for a much simpler mpsc situation it was significantly more expensive in the average case than allocating. I remember some work being done here on an mpmc unbounded queue which would grow in a fashion similar to a vector, albeit with massive latency tails whenever a resize happened.

So instead of using an mpmc queue, can we just make the junkyard a fixed size array (since we won't actually need an unlimited junkyard), and use lockless structure (e.g. a naive AtomicBool for in-use vs. free-to-use status) to fend the junkyard? This could help reduce allocations when frequent push-pop happens in a short period of time, as described in the OP.

schets commented 5 years ago

That’s more or less what I did, iirc. The problem was that with multiple producers, there’s contention of some free/in-use tag. On X86, this is especially expensive.

If one were to limit the junkyard to be pointer size of the machine, you could simply use an atomic bitmask which might be simple enough to beat allocation.

On Tue, Jul 2, 2019 at 12:13 PM Jacob Zuo notifications@github.com wrote:

A while ago I experimented with something like the junkyard idea, and even for a much simpler mpsc situation it was significantly more expensive in the average case than allocating. I remember some work being done here on an mpmc unbounded queue which would grow in a fashion similar to a vector, albeit with massive latency tails whenever a resize happened.

So instead of using an mpmc queue, we just make the junkyard a fixed size array (since we won't actually need an unlimited junkyard), and use lockless structure (e.g. a naive AtomicBool for in-use vs. free-to-use status) to fend the junkyard. This could help with frequent push-pop cycles, as the situation described in the OP.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/crossbeam-rs/crossbeam/issues/398?email_source=notifications&email_token=AA4M3CJPQG5IBDGSTWBW6QDP5N5BRA5CNFSM4H4VF4UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZBZHKQ#issuecomment-507745194, or mute the thread https://github.com/notifications/unsubscribe-auth/AA4M3CP7KKMIBGFVFXDVGXLP5N5BRANCNFSM4H4VF4UA .

frostyplanet commented 4 years ago

@ralfbiedert Would you shred some light on how to "measure()" allocation was done? I was looking at some memory problem inside my code, interested to know a tool for such problem

ralfbiedert commented 4 years ago

That was just an ugly #[global_allocator] where I toggled logging on and off with some static atomic.

I uploaded some old code here; it will not compile, but if you look into achtung_allocation/src/lib.rs it should be relatively clear.

Feel free to open an issue in that repo if you have questions.

frostyplanet commented 4 years ago

@ralfbiedert thanks very much for sharing

iriri commented 3 years ago

I was looking at the SegQueue implementation today and also noticed the somewhat unfortunate allocation behavior so I threw together a branch where up to one block is cached for reuse. This means that no additional block is allocated until the number of pushes exceeds the number of pops by BLOCK_CAP. Retaining more than one block is annoying as you have to accept some kind of compromise like never freeing a cached block until the entire queue is dropped, epoch collection, or FIFO block reuse. The first two are definitely non-starters as one of SegQueue's best qualities is its eager freeing.

The benchmarks show some mild improvements but I don't think they're really representative of typical use cases.

Linux x86-64 (16 cores)

master:

test unbounded::create    ... bench:         124 ns/iter (+/- 4)
test unbounded::inout     ... bench:          40 ns/iter (+/- 1)
test unbounded::mpmc      ... bench:     553,855 ns/iter (+/- 17,118)
test unbounded::mpsc      ... bench:   1,162,611 ns/iter (+/- 125,553)
test unbounded::oneshot   ... bench:         195 ns/iter (+/- 5)
test unbounded::par_inout ... bench:   2,224,968 ns/iter (+/- 55,949)
test unbounded::spmc      ... bench:   1,211,853 ns/iter (+/- 65,897)
test unbounded::spsc      ... bench:     956,677 ns/iter (+/- 47,411)

cache-segqueue-block:

test unbounded::create    ... bench:         122 ns/iter (+/- 3)
test unbounded::inout     ... bench:          40 ns/iter (+/- 1)
test unbounded::mpmc      ... bench:     518,356 ns/iter (+/- 23,531)
test unbounded::mpsc      ... bench:   1,130,890 ns/iter (+/- 70,117)
test unbounded::oneshot   ... bench:         207 ns/iter (+/- 5)
test unbounded::par_inout ... bench:   2,174,341 ns/iter (+/- 33,822)
test unbounded::spmc      ... bench:   1,115,574 ns/iter (+/- 65,119)
test unbounded::spsc      ... bench:     784,513 ns/iter (+/- 29,810)

Windows x86-64 (8 cores)

master:

test unbounded::create    ... bench:         105 ns/iter (+/- 1)
test unbounded::inout     ... bench:          47 ns/iter (+/- 0)
test unbounded::mpmc      ... bench:   1,620,820 ns/iter (+/- 116,951)
test unbounded::mpsc      ... bench:   5,490,485 ns/iter (+/- 184,075)
test unbounded::oneshot   ... bench:         215 ns/iter (+/- 4)
test unbounded::par_inout ... bench:   9,814,435 ns/iter (+/- 79,270)
test unbounded::spmc      ... bench:  47,891,120 ns/iter (+/- 57,213,579)
test unbounded::spsc      ... bench:   1,190,482 ns/iter (+/- 362,679)

cache-segqueue-block:

test unbounded::create    ... bench:         104 ns/iter (+/- 2)
test unbounded::inout     ... bench:          51 ns/iter (+/- 1)
test unbounded::mpmc      ... bench:   1,583,942 ns/iter (+/- 53,008)
test unbounded::mpsc      ... bench:   5,526,165 ns/iter (+/- 161,594)
test unbounded::oneshot   ... bench:         218 ns/iter (+/- 3)
test unbounded::par_inout ... bench:   9,757,505 ns/iter (+/- 211,407)
test unbounded::spmc      ... bench:  32,795,000 ns/iter (+/- 37,473,800)
test unbounded::spsc      ... bench:     888,141 ns/iter (+/- 169,239)

Another quirk in SegQueue is the lazy allocation of the first block. This is necessitated by SegQueue::new being a const fn, which honestly doesn't seem very useful right now.

cuviper commented 3 years ago

FWIW, rayon-rs/rayon#791 switched to crossbeam_deque::Injector, but at the time I saw that the code was almost identical to SeqQueue, so I would guess that the same allocation questions arise. (It doesn't have the lazy first block though.) I hope your fixes could be applied to Injector too, because those performance numbers look promising!

iriri commented 3 years ago

If one were to limit the junkyard to be pointer size of the machine, you could simply use an atomic bitmask which might be simple enough to beat allocation.

So I implemented a variant of this on an aptly named branch where the head and tail of a four element intentionally imperfect ring buffer are stuffed into a single AtomicU64 and it doesn't seem to do any better on the existing benchmarks than the version that caches only a single block. The SPSC benchmark might be ~5% faster on Windows x86-64 but there's a lot of run-to-run variation so it's hard to be sure.

FWIW, rayon-rs/rayon#791 switched to crossbeam_deque::Injector, but at the time I saw that the code was almost identical to SeqQueue, so I would guess that the same allocation questions arise. (It doesn't have the lazy first block though.) I hope your fixes could be applied to Injector too, because those performance numbers look promising!

I haven't looked at deque.rs too closely yet but it's a little strange that it just inlines what appears to be more or less a superset of seg_queue.rs. But yes, the same optimization can be done there too.

I think the main difficulty here is coming up with some benchmarks to actually determine if this change is worth it or not. Concurrent queue benchmarks tend to be extremely easy to game as less concurrency usually means higher throughput.

cuviper commented 3 years ago

The timing variance might be interesting too, if you can measure that -- it will probably be more consistent in the happy path that simply keeps reusing the same allocation.

iriri commented 3 years ago

Decided to revisit this after some time and added a rough latency benchmark. There is some run-to-run variation (a lot, actually, in the case of master) but these results are fairly representative. I was hoping I could abandon the branch where I cache multiple blocks, but unfortunately it seems to do quite a bit better here than caching just one.

Linux x86-64 (8 cores)

master:

spsc | min: 0.000021100, max: 0.000168200, mean: 0.000037867, stddev: 0.000000412
spsc | min: 0.000018300, max: 0.000390400, mean: 0.000033402, stddev: 0.000000869
mpsc | min: 0.000013600, max: 0.007470500, mean: 0.000469169, stddev: 0.000014732
mpsc | min: 0.000096200, max: 0.017162200, mean: 0.001090689, stddev: 0.000031671
spmc | min: 0.000016400, max: 0.000177900, mean: 0.000043798, stddev: 0.000000613
spmc | min: 0.000015800, max: 0.000243400, mean: 0.000043987, stddev: 0.000000597
mpmc | min: 0.000018300, max: 0.010767200, mean: 0.000644508, stddev: 0.000021174
mpmc | min: 0.000014700, max: 0.013832700, mean: 0.000513442, stddev: 0.000022520

Caching one block:

spsc | min: 0.000017100, max: 0.000142800, mean: 0.000026373, stddev: 0.000000295
spsc | min: 0.000017200, max: 0.000122100, mean: 0.000022872, stddev: 0.000000158
mpsc | min: 0.000031600, max: 0.002537700, mean: 0.000549827, stddev: 0.000005980
mpsc | min: 0.000017900, max: 0.018825900, mean: 0.000755237, stddev: 0.000023489
spmc | min: 0.000014900, max: 0.000116100, mean: 0.000023121, stddev: 0.000000142
spmc | min: 0.000015300, max: 0.000115700, mean: 0.000019334, stddev: 0.000000156
mpmc | min: 0.000020900, max: 0.020000700, mean: 0.000364887, stddev: 0.000019470
mpmc | min: 0.000020300, max: 0.008355400, mean: 0.000300422, stddev: 0.000011621

Caching four blocks:

spsc | min: 0.000014500, max: 0.000104800, mean: 0.000017722, stddev: 0.000000110
spsc | min: 0.000016400, max: 0.000136500, mean: 0.000019739, stddev: 0.000000360
mpsc | min: 0.000017500, max: 0.004407900, mean: 0.000139323, stddev: 0.000005644
mpsc | min: 0.000015900, max: 0.002580300, mean: 0.000097392, stddev: 0.000003047
spmc | min: 0.000021000, max: 0.000073600, mean: 0.000023748, stddev: 0.000000101
spmc | min: 0.000014200, max: 0.000393100, mean: 0.000019547, stddev: 0.000000539
mpmc | min: 0.000019400, max: 0.001835600, mean: 0.000142146, stddev: 0.000003811
mpmc | min: 0.000015600, max: 0.008020400, mean: 0.000162701, stddev: 0.000009684

Windows x86-64 (8 cores)

master:

spsc | min: 0.000016300, max: 0.000164700, mean: 0.000022681, stddev: 0.000000472
spsc | min: 0.000013900, max: 0.000142100, mean: 0.000018740, stddev: 0.000000325
mpsc | min: 0.000064100, max: 0.001707300, mean: 0.000293963, stddev: 0.000002431
mpsc | min: 0.000019300, max: 0.002056900, mean: 0.000606370, stddev: 0.000010348
spmc | min: 0.000014700, max: 0.000171900, mean: 0.000024700, stddev: 0.000000506
spmc | min: 0.000016000, max: 0.000197400, mean: 0.000024749, stddev: 0.000000516
mpmc | min: 0.000019300, max: 0.001680300, mean: 0.000331722, stddev: 0.000004057
mpmc | min: 0.000027200, max: 0.000484300, mean: 0.000231683, stddev: 0.000000722

Caching one block:

spsc | min: 0.000017500, max: 0.000274400, mean: 0.000022215, stddev: 0.000000417
spsc | min: 0.000014900, max: 0.000168100, mean: 0.000021970, stddev: 0.000000572
mpsc | min: 0.000020300, max: 0.001446700, mean: 0.000498424, stddev: 0.000005247
mpsc | min: 0.000018300, max: 0.001163600, mean: 0.000451278, stddev: 0.000004430
spmc | min: 0.000014500, max: 0.000221800, mean: 0.000022019, stddev: 0.000000546
spmc | min: 0.000014300, max: 0.000228200, mean: 0.000021656, stddev: 0.000000587
mpmc | min: 0.000018200, max: 0.006011800, mean: 0.000314497, stddev: 0.000006603
mpmc | min: 0.000016000, max: 0.002495500, mean: 0.000218906, stddev: 0.000004683

Caching four blocks:

spsc | min: 0.000015700, max: 0.000243100, mean: 0.000021335, stddev: 0.000000439
spsc | min: 0.000014600, max: 0.000147500, mean: 0.000019120, stddev: 0.000000433
mpsc | min: 0.000218700, max: 0.000928000, mean: 0.000434571, stddev: 0.000003554
mpsc | min: 0.000190500, max: 0.000869100, mean: 0.000253494, stddev: 0.000001212
spmc | min: 0.000014500, max: 0.000212200, mean: 0.000022515, stddev: 0.000000526
spmc | min: 0.000014700, max: 0.000245100, mean: 0.000021777, stddev: 0.000000517
mpmc | min: 0.000019200, max: 0.002800600, mean: 0.000231056, stddev: 0.000002596
mpmc | min: 0.000066100, max: 0.000900200, mean: 0.000288129, stddev: 0.000001246
SUPERCILEX commented 2 years ago

@taiki-e Any updates on this?