Closed kai981 closed 1 year ago
Hey! That sounds amazing. Would be great if you can also add these layer into DeltaCNN via pull request :)
threadsPerBlock=64, but pixels_per_block=2. Every warp (=32 threads) will process one pixel, every thread will process 1 channel at a time. So, if you have e.g. 256 channels, 1 thread will upsample 8 channels of 1 pixel. this means that all pixels are processed in parallel. Every warp is processing a single pixel. this way, we have a great memory access pattern, since warp is accessing coalesced memory when torch is set to channels_last memory format. This means that you cannot access the mask / val_in / val_out using the global thread ID, but the global warp ID (= globalThreadID / 32).
The upsample layer is not processing tiles, that is correct. Actually, only convolution and pooling are processed in tiles, everything else is processed in order of pixels, because these layers do not depend on neighbor pixel values.
in the very sparse processing mode, only the filters that are actually used are loaded in. But I would recommend not to implement this mode in 3D first, because it is very difficult to optimize in a way that is actually faster than dense tiles. This is why I also only implemented this mode for 3x3 convolutions without striding and dilation. It is just really difficult and the benefit is small. I'd recommend to try to nail the performance of the 3D convolution where each tile is either skipped or processed densely. Use Nsight Compute to see how the kernel is performing and what the bottlenecks are. This app can steer you into the right direction if something is slowing the execution down (most likely, memory access or too many registers spilling into global memory).
Hi! I'd love to contribute if I manage to figure it out :)
I have some follow-up questions for 1:
I am trying to understand how the mask/val_in/val_out are being accessed with the global ID. For the case of upsample layer with scale=2, checkIfAnyActive
function is used and I wrote out the values according my own understanding:
In the code, px_idx is computed as int px_idx = start_idx + (threadIdx.x % WARP_SIZE);
, and values in mask are being accessed using this px_idx. It seems to me that the two warps in the same block has the same px_idx? And the condition is uint32_t active = px_idx < end_idx ? mask[px_idx] : 0;
. So it seems to me that 2 threads are used for each pixel (i.e. each warp). But 1 thread is for 1 channel, and mask, I believe, has just 1 channel?
Regarding how the mask/val_in/val_out are arranged in GPU memory, if the input is 1x5x5x2 for example, then global ID of 0 is accessing (0,0) first channel, global ID of 1 is accessing (0,0) of second channel (since they will be in channel last format), global ID of 2 is accessing (0,1) of first channel, and so on?
Another confusion is that, the first chunk of code (if condition with checkIfAnyActive
function) is updating mask_out for inactive pixel locations. For the second chunk of code, it is updating mask_out for active pixel locations I believe? Why does it have to check mask_in again? bool active = mask_in[px_idx] != 0;
A follow-up question to 3: For example input is 1x10x10x5, output is 1x10x10x6, and maybe the tile size is 5. Only (0,1) is active, so it's in very sparse mode. So if filter size is 3x3, input values at (0,0), (0,1),(0,2),(1,0),(1,1),(1,2) are loaded, and all filters 3x3x5x6 will be loaded? Or when you say "only the filters that are actually used are loaded in", does it mean that it will still check at (0,1), which channels are active, for example, only at channel 1 and 5, the filters loaded in will be of shape 3x3x2x6?
Thank you!
That is not entirely correct: all 32 threads of warp=0 use px_idx = 0. All of warp=2 use px_idx=1. The only point where this behaviour is changed is in line 3465-3476 and 3490-3496, and this is only a really unrelevant performance optimization that you can ignore / simplify without losing much performance.
In the code, px_idx is computed as
int px_idx = start_idx + (threadIdx.x % WARP_SIZE);
,
No, it is actually int px_idx = start_idx + (threadIdx.x / WARP_SIZE);
(Except of course for the mask writing mentioned above)
and values in mask are being accessed using this px_idx.
Correct.
It seems to me that the two warps in the same block has the same px_idx? And the condition is
uint32_t active = px_idx < end_idx ? mask[px_idx] : 0;
. So it seems to me that 2 threads are used for each pixel (i.e. each warp).
No. 1 warp accesses 1 pixel and every thread inside the warp accesses 1 (or multiple) channel (using lane_idx in line 3509 e.g.).
But 1 thread is for 1 channel, and mask, I believe, has just 1 channel?
Yes, mask has only 1 channel.
Regarding how the mask/val_in/val_out are arranged in GPU memory, if the input is 1x5x5x2 for example, then global ID of 0 is accessing (0,0) first channel, global ID of 1 is accessing (0,0) of second channel (since they will be in channel last format), global ID of 2 is accessing (0,1) of first channel, and so on?
Exactly!
Another confusion is that, the first chunk of code (if condition with
checkIfAnyActive
function) is updating mask_out for inactive pixel locations. For the second chunk of code, it is updating mask_out for active pixel locations I believe? Why does it have to check mask_in again?bool active = mask_in[px_idx] != 0;
I use the checkIfAnyActive function in all kernels to exit as early as possible. But you are right: In this kernel, it does lead to a lot of code duplication, without any benefit. You can remove this first check and the code will still work and be much cleaner. You should probably know that I did not properly clean up the code once it was working and fast enough. This was only a one man research project. You might find more unnecessary code in there ;-)
A follow-up question to 3: For example input is 1x10x10x5, output is 1x10x10x6, and maybe the tile size is 5. Only (0,1) is active, so it's in very sparse mode. So if filter size is 3x3, input values at (0,0), (0,1),(0,2),(1,0),(1,1),(1,2) are loaded, and all filters 3x3x5x6 will be loaded?
If only (0,1) is active, only this pixel will be loaded, but all neighbor pixels will be updated if they are inside the bounds. If tile size is 5x5, the output will be 3x3. Let's assume that only input (0,0) out of (5x5) is active, the only the output that is updated is (0,0) out of (3x3). This means that we also only need to load 1x1x5x6 weights -> only the weight that produces the output for the bottom right neighbor if the active input.
Or when you say "only the filters that are actually used are loaded in", does it mean that it will still check at (0,1), which channels are active, for example, only at channel 1 and 5, the filters loaded in will be of shape 3x3x2x6?
No, I only check the update mask per pixel, not if a channel is actually active or not. Checking individual channels is too slow. But again: writing this "very" sparse implementation for 3D convs is crazy difficult and I would not recommend it unless you have basically unlimited time for performance optimizations ;-)
Hi! I would have to say this is really impressive being a one-man project!
I am trying to see if I can run some tests on the CUDA codes to have a better understanding before attempting to start on the custom layer. However, to access the CUDA functions on python, I think I would have to do some python bindings using ctypes or invoke? Do you have any suggestions on which library I should be looking at?
Thank you!
Hi!
I would recommend to write cuda pytorch extension directly just as done in deltacnn
Hi! Back with more questions again :)
I am attempting to write the upsamplebilinear layer, and several 3D layers, so am trying to understand the CUDA codes. Starting with the easier one, I am looking at DCUpsampingNearest2d first, and specifically this kernel function: https://github.com/facebookresearch/DeltaCNN/blob/9c80029d21fc150c9b94dc2de7d6627d6821f109/src/cuda/other_nn_layers.cu#L3451 The configuration of the kernel seem to be numBlocks=(B*W*H)/2, and threadsPerBlock=2, so (B*W*H)/2 number of pixels for the input will be processed in parallel? Each pixel will be assigned to one thread, so accessing corresponding mask values or val_in/val_out values for that pixel will need to index according to the global thread index?
I have not fully understood the kernel code yet, but it means that the upsample layer is not being processed in tiles? So only the convolution layer is processed in tiles, and nonlinear layers are processed pixel-by-pixel?
For very sparse processing mode (in the case of convolution), all filter weights will have to be loaded just like the dense processing mode, just that they are only used for computation on an array of active pixels instead of the entire tile, is that right?
THANK YOU!!