Megvii-BaseDetection / BEVDepth

Official code for BEVDepth.
MIT License
710 stars 98 forks source link

Confused about Support efficient voxel pooling v2 #124

Closed defei-coder closed 1 year ago

defei-coder commented 1 year ago

Recently,voxel pooling update code for efficient.(https://github.com/Megvii-BaseDetection/BEVDepth/commit/b150910029607292b44a92f950d4a2cd1f3d8543). I have a confusion about the code. Why use shared memory? In my opinion, const int sample_x = geom_xyz[thread_sample_idx * 3 + 0]; const int sample_y = geom_xyz[thread_sample_idx * 3 + 1]; const int sample_z = geom_xyz[thread_sample_idx * 3 + 2]; will save value in register, why we save them in shared memory again? Because of different scopes? I mean if delete shared memory, we can not use __syncthreads().

HanBao99 commented 1 year ago

The block configuration is 32x4. First, we let 1 block(128 threads) load 128 point coordinates into shared memory. Then we let 1 warp(32 threads) deal with 1 point feature with size num_channel per iteration. In this way, the memory access of the feature tensor is coalesced. We here use shared memory is to speed up the access of point coordinates in the loop after __synthreads().

defei-coder commented 1 year ago

@hanbao007 Thanks for replying! I know the second loop is for speeding up the memory access of the feature, but I do not know why use shared_memory. I guess the shared memory is used for save coordinates, due to two scopes in the funcion(if condition and for loop). I think the second loop can move into the scope of if (thread_sample_idx < total_samples), and the sample_x sample_y sample_z had been in register, this can be easier. Maybe there are some things I haven't considered.

HanBao99 commented 1 year ago

Yes, the two scopes are different. We need to use shared memory to exchange data among threads. And we can not move the second loop into the scope of if (thread_sample_idx < total_samples). Because the first 32 threads in the second loop fetch the first point coordinate while the first 32 threads in the first loop fetch the first 32 point coordinates.

defei-coder commented 1 year ago

I get! I mismatched the index, const int sample_x = geom_xyz[thread_sample_idx 3 + 0];//thread_sample_idx is the thread idx in block, but the const int sample_x = geom_xyz_shared[i 3 + 0];// i is threadIdx.y. Thank you for your answer!