TanTanDev / binary_greedy_mesher_demo

Other
181 stars 26 forks source link

Some optimizitions #3

Closed Adar-Dagan closed 2 months ago

Adar-Dagan commented 2 months ago

Some optimizations I found:

  1. axis_cols and col_face_masks are too large. For example for axis_cols we need 34x34x34 bits per axis meaning we need an array of 34x34 of u64, the original axis_cols was of size (CHUNK_SIZE_P ^ 3) * 3 when we need (CHUNK_SIZE_P ^ 2) * 3. The exact same problem was in col_face_masks. This is the most significant optimization I found, it almost completely eliminates the time it takes to initialize the arrays.
  2. Changed axis_cols to be allocated on the stack, this isn't very important in terms of runtime but it allows us to use subarrays instead of a flat array while still getting a continuous piece of memory for that array. So it doesn't harm performance but there is no longer a need for complicated index calculations. If you want the array to be on the heap instead then we could use a Box with this array type to still get its benefits. Also there is overhead in using Vec so if the array is of constant length then a regular array is better.
  3. The biggest penalty in the loops to populate axis_cols is the reading from memory of the block, I think this is because of the two levels of indirection of the chunks. The first is because the chunk is in a Vec and Arc(this might actually be two levels since those are both pointers) and the second is the Vec in chunk that holds the blocks data. So I changed the loop order so we would read the chunks in the order they are laid out in memory(to get cache hits).
  4. The last one surprised me, I played with the match statements in get block and got an improvement. I think this has to do with compiler optimizations of the code.

I didn't go deep into the code but two other things I noticed that might be worth taking a look at:

Architector4 commented 2 months ago

Allocating axis_cols on the stack is likely a bad idea, since it consumes ~2.7MiB of stack space for this. Total stack size on Linux on x86_64, by default, is 8MiB, and may easily be 4MiB or less on other architectures or operating systems. If the mesher is called from a deeper point in the stack, or on such a platform, it will crash.

Edit: Oh oops, I failed to read the first optimization. Still, it seems like it'd be quite a significant stack size anyway lol

leddoo commented 2 months ago

good catch there with the vecs!

leddoo commented 2 months ago

some tweaks/improvements:

  1. axis_cols and col_face_masks are too large.

i made col_face_masks a multi-dimensional array too, which simplifies the code.

  1. Changed axis_cols to be allocated on the stack.

both arrays combined are less than 100k. as this is a leaf function, which isn't expected to be called from a deep call stack, this is completely fine to have on the stack.

  1. The biggest penalty in the loops to populate axis_cols is the reading from memory of the block

i improved this by special casing the center chunk (reading directly from the voxels array). for the neighboring chunks, i have added 3 loops based on the original code. the neighbors (6,936 voxels) take almost as much time as the center chunk (32,768 voxels), so it would probably be a good idea to special case those too.

  1. I played with the match statements in get block and got an improvement

that logic can be done entirely branch free.

        let (x_chunk, x) = ((pos.x+32)/32, (pos.x+32)%32);
        let (y_chunk, y) = ((pos.y+32)/32, (pos.y+32)%32);
        let (z_chunk, z) = ((pos.z+32)/32, (pos.z+32)%32);


here are some timings for the improvements. "direct reads" is the "special casing the center chunk" thing.

i7-8750h                                avg     %reduction relative to previous
baseline:       234, 208, 174, 234 µs  212 µs   na
smaller arrays:  96, 101, 101,  94 µs   98 µs   54%
direct reads:    80,  79,  67,  81 µs   77 µs   21%

m2 pro
baseline:       124, 124, 125, 124 µs  124 µs   na
smaller arrays:  54,  54,  54,  54 µs   54 µs   56%
direct reads:    45,  44,  45,  45 µs   45 µs   17%

the branch free chunk position function made no measurable difference on the m2 pro. i didn't benchmark it on the intel.

the modified `build_chunk_mesh` function ```rust pub fn build_chunk_mesh(chunks_refs: &ChunksRefs, lod: Lod) -> Option { // early exit, if all faces are culled if chunks_refs.is_all_voxels_same() { return None; } let mut mesh = ChunkMesh::default(); // kudos to Adar-Dagan for catching that these were too large. // solid binary for each x,y,z axis (3) let mut axis_cols = [[[0u64; CHUNK_SIZE_P]; CHUNK_SIZE_P]; 3]; // the cull mask to perform greedy slicing, based on solids on previous axis_cols let mut col_face_masks = [[[0u64; CHUNK_SIZE_P]; CHUNK_SIZE_P]; 6]; #[inline] fn add_voxel_to_axis_cols( b: &crate::voxel::BlockData, x: usize, y: usize, z: usize, axis_cols: &mut [[[u64; 34]; 34]; 3]) { if b.block_type.is_solid() { // x,z - y axis axis_cols[0][z][x] |= 1u64 << y as u64; // z,y - x axis axis_cols[1][y][z] |= 1u64 << x as u64; // x,y - z axis axis_cols[2][y][x] |= 1u64 << z as u64; } } // inner chunk voxels. let chunk = &*chunks_refs.chunks[vec3_to_index(IVec3::new(1, 1, 1), 3)]; assert!(chunk.voxels.len() == CHUNK_SIZE*CHUNK_SIZE*CHUNK_SIZE || chunk.voxels.len() == 1); for z in 0..CHUNK_SIZE { for y in 0..CHUNK_SIZE { for x in 0..CHUNK_SIZE { let i = match chunk.voxels.len() { 1 => 0, _ => (z*CHUNK_SIZE + y)*CHUNK_SIZE + x, }; add_voxel_to_axis_cols( &chunk.voxels[i], x+1, y+1, z+1, &mut axis_cols) } } } // neighbor chunk voxels. // note(leddoo): couldn't be bothered to optimize these. // might be worth it though. together, they take // almost as long as the entire "inner chunk" loop. for z in [0, CHUNK_SIZE_P-1] { for y in 0..CHUNK_SIZE_P { for x in 0..CHUNK_SIZE_P { let pos = ivec3(x as i32, y as i32, z as i32) - IVec3::ONE; add_voxel_to_axis_cols( chunks_refs.get_block(pos), x, y, z, &mut axis_cols); } } } for y in [0, CHUNK_SIZE_P-1] { for z in 0..CHUNK_SIZE_P { for x in 0..CHUNK_SIZE_P { let pos = ivec3(x as i32, y as i32, z as i32) - IVec3::ONE; add_voxel_to_axis_cols( chunks_refs.get_block(pos), x, y, z, &mut axis_cols); } } } for x in [0, CHUNK_SIZE_P-1] { for z in 0..CHUNK_SIZE_P { for y in 0..CHUNK_SIZE_P { let pos = ivec3(x as i32, y as i32, z as i32) - IVec3::ONE; add_voxel_to_axis_cols( chunks_refs.get_block(pos), x, y, z, &mut axis_cols); } } } // face culling for axis in 0..3 { for z in 0..CHUNK_SIZE_P { for x in 0..CHUNK_SIZE_P { // set if current is solid, and next is air let col = axis_cols[axis][z][x]; // sample descending axis, and set true when air meets solid col_face_masks[2*axis + 0][z][x] = col & !(col << 1); // sample ascending axis, and set true when air meets solid col_face_masks[2*axis + 1][z][x] = col & !(col >> 1); } } } // greedy meshing planes for every axis (6) // key(block + ao) -> HashMap // note(leddoo): don't ask me how this isn't a massive blottleneck. // might become an issue in the future, when there are more block types. // consider using a single hashmap with key (axis, block_hash, y). let mut data: [HashMap>; 6]; data = [ HashMap::new(), HashMap::new(), HashMap::new(), HashMap::new(), HashMap::new(), HashMap::new(), ]; // find faces and build binary planes based on the voxel block+ao etc... for axis in 0..6 { for z in 0..CHUNK_SIZE { for x in 0..CHUNK_SIZE { // skip padded by adding 1(for x padding) and (z+1) for (z padding) let mut col = col_face_masks[axis][z+1][x+1]; // removes the right most padding value, because it's invalid col >>= 1; // removes the left most padding value, because it's invalid col &= !(1 << CHUNK_SIZE as u64); while col != 0 { let y = col.trailing_zeros(); // clear least significant set bit col &= col - 1; // get the voxel position based on axis let voxel_pos = match axis { 0 | 1 => ivec3(x as i32, y as i32, z as i32), // down,up 2 | 3 => ivec3(y as i32, z as i32, x as i32), // left, right _ => ivec3(x as i32, z as i32, y as i32), // forward, back }; // calculate ambient occlusion let mut ao_index = 0; for (ao_i, ao_offset) in ADJACENT_AO_DIRS.iter().enumerate() { // ambient occlusion is sampled based on axis(ascent or descent) let ao_sample_offset = match axis { 0 => ivec3(ao_offset.x, -1, ao_offset.y), // down 1 => ivec3(ao_offset.x, 1, ao_offset.y), // up 2 => ivec3(-1, ao_offset.y, ao_offset.x), // left 3 => ivec3(1, ao_offset.y, ao_offset.x), // right 4 => ivec3(ao_offset.x, ao_offset.y, -1), // forward _ => ivec3(ao_offset.x, ao_offset.y, 1), // back }; let ao_voxel_pos = voxel_pos + ao_sample_offset; let ao_block = chunks_refs.get_block(ao_voxel_pos); if ao_block.block_type.is_solid() { ao_index |= 1u32 << ao_i; } } let current_voxel = chunks_refs.get_block_no_neighbour(voxel_pos); // let current_voxel = chunks_refs.get_block(voxel_pos); // we can only greedy mesh same block types + same ambient occlusion let block_hash = ao_index | ((current_voxel.block_type as u32) << 9); let data = data[axis] .entry(block_hash) .or_default() .entry(y) .or_default(); data[x as usize] |= 1u32 << z as u32; } } } } let mut vertices = vec![]; for (axis, block_ao_data) in data.into_iter().enumerate() { let facedir = match axis { 0 => FaceDir::Down, 1 => FaceDir::Up, 2 => FaceDir::Left, 3 => FaceDir::Right, 4 => FaceDir::Forward, _ => FaceDir::Back, }; for (block_ao, axis_plane) in block_ao_data.into_iter() { let ao = block_ao & 0b111111111; let block_type = block_ao >> 9; for (axis_pos, plane) in axis_plane.into_iter() { let quads_from_axis = greedy_mesh_binary_plane(plane, lod.size() as u32); quads_from_axis.into_iter().for_each(|q| { q.append_vertices(&mut vertices, facedir, axis_pos, &Lod::L32, ao, block_type) }); } } } mesh.vertices.extend(vertices); if mesh.vertices.is_empty() { None } else { mesh.indices = generate_indices(mesh.vertices.len()); Some(mesh) } } ```
leddoo commented 2 months ago

update on get block: these are signed integers, so the optimizer is forced to generate poor code for the signed div/rem. we know that the result is non-negative after adding 32, so convert to u32 for the div/rem to have them compiled as bit ops.

        let x = (pos.x + 32) as u32;
        let y = (pos.y + 32) as u32;
        let z = (pos.z + 32) as u32;
        let (x_chunk, x) = ((x/32) as i32, (x%32) as i32);
        let (y_chunk, y) = ((y/32) as i32, (y%32) as i32);
        let (z_chunk, z) = ((z/32) as i32, (z%32) as i32);

this gets it down to 42.5 µs on the m2, so another ~5%.

Adar-Dagan commented 2 months ago

@leddoo Thanks for the improvements! I added them to the PR.

I agree that keeping the arrays on the stack is fine as their size is small enough now. But I checked what would happen if we put them in a Box and there was a measurable regression.

Currently these optimizations on my computer brought the mean time from 280us down to around 80us. The inner chunk, neighbors chunk and the actual calculation take around the same time each and the allocation is negligible.

I think there is still room for improvement in the neighbors chunks, we could ignore the corner chunks as they are not important to the face calculation of the current chunk and we could implement chunk caching like for the inner chunk. I am not going to try and add that to this PR as I want to close it. If I'll have time then I'll take a look at it later, unless someone else has already done it.