Closed Adar-Dagan closed 2 months ago
Allocating axis_cols on the stack is likely a bad idea, since it consumes ~2.7MiB of stack space for this. Total stack size on Linux on x86_64, by default, is 8MiB, and may easily be 4MiB or less on other architectures or operating systems. If the mesher is called from a deeper point in the stack, or on such a platform, it will crash.
Edit: Oh oops, I failed to read the first optimization. Still, it seems like it'd be quite a significant stack size anyway lol
good catch there with the vecs!
some tweaks/improvements:
axis_cols
andcol_face_masks
are too large.
i made col_face_masks
a multi-dimensional array too, which simplifies the code.
- Changed
axis_cols
to be allocated on the stack.
both arrays combined are less than 100k. as this is a leaf function, which isn't expected to be called from a deep call stack, this is completely fine to have on the stack.
- The biggest penalty in the loops to populate axis_cols is the reading from memory of the block
i improved this by special casing the center chunk (reading directly from the voxels array). for the neighboring chunks, i have added 3 loops based on the original code. the neighbors (6,936 voxels) take almost as much time as the center chunk (32,768 voxels), so it would probably be a good idea to special case those too.
- I played with the match statements in get block and got an improvement
that logic can be done entirely branch free.
let (x_chunk, x) = ((pos.x+32)/32, (pos.x+32)%32);
let (y_chunk, y) = ((pos.y+32)/32, (pos.y+32)%32);
let (z_chunk, z) = ((pos.z+32)/32, (pos.z+32)%32);
here are some timings for the improvements. "direct reads" is the "special casing the center chunk" thing.
i7-8750h avg %reduction relative to previous
baseline: 234, 208, 174, 234 µs 212 µs na
smaller arrays: 96, 101, 101, 94 µs 98 µs 54%
direct reads: 80, 79, 67, 81 µs 77 µs 21%
m2 pro
baseline: 124, 124, 125, 124 µs 124 µs na
smaller arrays: 54, 54, 54, 54 µs 54 µs 56%
direct reads: 45, 44, 45, 45 µs 45 µs 17%
the branch free chunk position function made no measurable difference on the m2 pro. i didn't benchmark it on the intel.
update on get block: these are signed integers, so the optimizer is forced to generate poor code for the signed div/rem. we know that the result is non-negative after adding 32, so convert to u32 for the div/rem to have them compiled as bit ops.
let x = (pos.x + 32) as u32;
let y = (pos.y + 32) as u32;
let z = (pos.z + 32) as u32;
let (x_chunk, x) = ((x/32) as i32, (x%32) as i32);
let (y_chunk, y) = ((y/32) as i32, (y%32) as i32);
let (z_chunk, z) = ((z/32) as i32, (z%32) as i32);
this gets it down to 42.5 µs on the m2, so another ~5%.
@leddoo Thanks for the improvements! I added them to the PR.
I agree that keeping the arrays on the stack is fine as their size is small enough now. But I checked what would happen if we put them in a Box and there was a measurable regression.
Currently these optimizations on my computer brought the mean time from 280us down to around 80us. The inner chunk, neighbors chunk and the actual calculation take around the same time each and the allocation is negligible.
I think there is still room for improvement in the neighbors chunks, we could ignore the corner chunks as they are not important to the face calculation of the current chunk and we could implement chunk caching like for the inner chunk. I am not going to try and add that to this PR as I want to close it. If I'll have time then I'll take a look at it later, unless someone else has already done it.
Some optimizations I found:
axis_cols
andcol_face_masks
are too large. For example foraxis_cols
we need 34x34x34 bits per axis meaning we need an array of 34x34 of u64, the originalaxis_cols
was of size(CHUNK_SIZE_P ^ 3) * 3
when we need(CHUNK_SIZE_P ^ 2) * 3
. The exact same problem was incol_face_masks
. This is the most significant optimization I found, it almost completely eliminates the time it takes to initialize the arrays.axis_cols
to be allocated on the stack, this isn't very important in terms of runtime but it allows us to use subarrays instead of a flat array while still getting a continuous piece of memory for that array. So it doesn't harm performance but there is no longer a need for complicated index calculations. If you want the array to be on the heap instead then we could use a Box with this array type to still get its benefits. Also there is overhead in using Vec so if the array is of constant length then a regular array is better.axis_cols
is the reading from memory of the block, I think this is because of the two levels of indirection of the chunks. The first is because the chunk is in a Vec and Arc(this might actually be two levels since those are both pointers) and the second is the Vec in chunk that holds the blocks data. So I changed the loop order so we would read the chunks in the order they are laid out in memory(to get cache hits).I didn't go deep into the code but two other things I noticed that might be worth taking a look at:
ChunksRefs
there is a Vec that holds the chunks, maybe it could also be an array if it is always the same size. The same thing forChunkData
.ChunkRefs
eachChunkData
is held in an Arc, first is a pointer there really necessary? And if you do need a reference counting pointer maybe RC would be fine, according to what I know bevy helps you write code that can be executed as if it is sequential.