Greedy mesher could be optimized by allocating less

TanTanDev / binary_greedy_mesher_demo

Other

243 stars 29 forks source link

Greedy mesher could be optimized by allocating less #4

Open Architector4 opened 7 months ago

Architector4 commented 7 months ago

The optimized greedy mesher implementation makes multiple heap allocations every time it's run; namely, allocating two Vecs and a few hashmaps.

It likely will be faster to create thread-local static objects and use those instead, to reuse the same allocation on each invocation instead of reallocating each time.

It may be even more faster, though more dangerous, to allocate data directly on the stack as fixed size arrays. The two vectors together replaced to arrays would result in 2829888 bytes (~2.7MiB) of stack space usage. It seems the default stack size on Linux is 8MiB, which this would be a considerable chunk of. It's entirely possible it's less on other architectures and systems, so I wouldn't recommend it.

leddoo commented 7 months ago

the hashmap allocations could indeed be an issue, though it's not clear, one would have to profile it. as for the two vecs, the cost of heap allocation would be insignificant cause clearing that memory takes much longer. but check out #3 cause those vecs don't actually need to be that large.

Architector4 commented 7 months ago

clearing that memory takes much longer

Does memory clearing actually take much time in a way that is significant? It appears to me that the code, after allocating the vecs, unconditionally writes a value into every single slot in them, which means the clearing could be optimized out by the compiler.

Moreso, the CPU will need to bring that entire allocation into an L2 (or whatever) cache anyway, so if the setting all bytes to 0 does actually happen, I feel like that would be an insignificant in-cache operation.

leddoo commented 7 months ago

unconditionally writes a value into every single slot in them.

the code uses |=, which relies on the values being zeroed.

the clearing could be optimized out by the compiler

this sort of thing is rare. and especially for heap allocations you can't rely on it (llvm doesn't have the necessary aliasing information, even though vec uses Unique internally, don't ask me what's going wrong there -- in cases, where it would be correct to optimize out a memset on a vec's buffer).

Moreso, the CPU will need to bring that entire allocation into an L2 (or whatever) cache anyway

there's an interesting question here: if the cpu can detect that the program is overwriting entire cache lines (with zero in this case), does it still fetch those cache lines from memory? it could in theory just initialize them in the cache. i suspect that the cpus i've tested can't or don't do that, otherwise you shouldn't see a 100 µs reduction, just from putting these arrays on the stack. heap allocation is slow, but not that slow. we can do a back of the envelope calculation: we're going from 2.7 MiB to ~0.1 MiB. so let's attribute the entire 100 µs reduction to memset. that would be 2.6 MiB / 100 µs = 25 GiB/s. the machine with the 100 µs improvement has a memory bandwidth of ~16 GiB/s. which isn't too far off, maybe if we factor in the cost of heap allocation and possibly some compiler optimizations, we could get it to add up. but judging by the 25 GiB/s figure, i think the cpu is indeed fetching the uninitialized memory, which it then overwrites with zeros. the rest of the computation should be entirely in (L3) cache though.

Feilkin commented 7 months ago

Edit: I was looking at the wrong code :\

~~Clearing a Vec is not expensive, unless the element type implements a expensive Drop. All .clear() does is set self.len to 0 and call ptr::drop_in_place() on the elements.

There also isn't much to optimize wrt. clearing, except the Drop.

The code is currently allocating two new vectors in the greedy mesher per function call, like Architector #4 said. This means you are getting at least two calls to the allocator per function call, which could be optimized, as stated above.

Resizing the allocation together with clear() works functionally the same as your current code, but it saves 2+ allocations. If you don't want to reuse the allocations, at least use Vec::with_capacity instead of vec![] to make sure you only get two allocations.~~

leddoo commented 6 months ago

i saw the edit, so i'll only respond to what's still relevant:

you are getting at least two calls to the allocator per function call

while that's of course correct, what i'm saying is that it doesn't really matter with the 2.8 MB vecs. when running on a performance core, allocating and freeing two vecs of 0.9 MB and 1.8 MB takes around 2.5 µs on my machine, which amounts to around 2% of the meshing function.

since 2.5 µs is a pretty large number for just two allocations in absolute terms, it would be interesting to investigate whether those calls actually go directly to the OS. in which case, the memory would already be zeroed! (we would of course have to go directly through mmap if we wanted to use that fact)

in terms of number of allocations, i'd be more worried about let mut data: [HashMap<u32, HashMap<u32, [u32; 32]>>; 6];, as more block types are added.

use Vec::with_capacity instead of vec![]

fyi, the vec! macro uses with_capacity internally.

leddoo commented 6 months ago

here's something funny: allocating the arrays on the stack also takes approximately 2 µs.

don't believe me? try it!

use core::mem::MaybeUninit;

#[inline(never)]
fn baseline() {
    core::hint::black_box(());
}

#[inline(never)]
fn alloc() {
    let v1 = [<MaybeUninit<u8>>::uninit(); 3*34*34*34*8];
    let v2 = [<MaybeUninit<u8>>::uninit(); 6*34*34*34*8];
    core::hint::black_box((v1, v2));
}

fn main() {
    // repeat to make sure initial page faults don't skew our results.
    for _ in 0..4 {

        let iters = 1_000_000;

        let t0 = std::time::Instant::now();
        for _ in 0..iters {
            baseline();
        }
        println!("baseline {:?}", t0.elapsed()/iters);

        let t0 = std::time::Instant::now();
        for _ in 0..iters {
            alloc();
        }
        println!("alloc {:?}", t0.elapsed()/iters);

    }
}

for stack frames larger than the native page size, the compiler inserts code that touches every page of the stack frame. this is to make sure stack overflows are detected reliably because there is typically only one guard page at the end of the stack.

this is known as stack probing, and in this case, it takes about the same amount of time as allocating the memory from the heap, interestingly enough. here's the asm:

alloc:
        mov     r11, rsp
        sub     r11, 2826240
-- this is the stack probing loop:
.LBB0_1:
        sub     rsp, 4096
        mov     qword ptr [rsp], 0
        cmp     rsp, r11
        jne     .LBB0_1

        sub     rsp, 3520
        lea     rax, [rsp - 128]
        add     rsp, 2829760
        ret

jcdickinson commented 4 months ago

I added pretty aggressive memory pooling everywhere: https://github.com/jcdickinson/binary_greedy_mesher_demo/commit/129f0b2e11469ac319fe147f00e468139a847e12 (pooling branch).

End result? Basically no difference. I uncapped the meshing tasks, and the performance is identical for the current head of this repo and my changes (~80-100ms for a full re-mesh on my 16core laptop - edit: in a release build). This includes setting up a bump allocator (this does have a ~20% improvement in the criterion benchmarks), and doing a very lazy partial fix for the hashmap, and some pooling some other pieces of memory.

I broke the cardinal rule of optimization which is profiling before optimizing :). I definitely optimized the wrong code. The hash set doesn't look like a problem for this repo as it stands.

Though, the memory improvements may well become important in a full game where other things are causing allocation pressure.

A better implementation might be to have all allocations in a bump allocator, and have a fixed set of threads each with its own bump allocator. This is something I plan to look at doing when I get home from vacation.

Another aside about cache locality. The two arrays are probably ruining cache locality at some point, their access patterns are very different across all passes (though they are optimized for the actual meshing). Hashmaps are almost definitely causing havoc with cache locality, and I didn't do much for this.