Open axeldavy opened 8 months ago
Just to be more explicit, here is one example of implementing the technique on backward_kernel_dcn: Load all the w_im and h_im in a table. Compute minimums and maximum. Based on the minimums and maximums, determine if any of the "if (h_im > -1 && w_im > -1 && h_im < height_in && w_im < width_in)" test will be false Based on the minimums and maximums, determine if any of the ifs inside ms_deform_attn_col2im_bilinear will be false
If any of them is false (for any elements of the warp), execute the current code. If none of them is false, execute a different loop that doesn't have any of the checks.
Similar thing can be done to the other functions.
Thanks for your valuable suggestions, we will definitely try this!
Thanks for looking at it, I hope you get a significant performance boost.
Here are some other ideas that could also improve performance, but that need research work (maybe for one of your future publication ?).
Dear authors,
Congrats on the job to optimize DCNv4, the changes make a lot of sense. I didn't expect that the previous DCNv3 was limited by the issue of memory instructions and too much computation, but it makes sense, specially for float16.
One issue I see with the current code is the memory accesses in ms_deform_attn_im2col_bilinear and ms_deform_attn_col2im_bilinear which are under different conditionals.
Similarly, the call to ms_deform_attn_col2im_bilinear is under a conditional itself.
These conditionals are of course necessary to avoid accessing outside the image.
My previous experience with GPU programming has shown me that it is hard for GPU compilers to optimize well this case, and reduce latency by issuing memory instructions significantly before their use.
Possibly recent CUDA compilers handle that just fine, and my past experience is no longer valid, and if you have seen very good GPU code generated, ignore my comment.
However if this still applies, a solution that I have found to give very significant performance boost in practice is the following: At the beginning of your kernel check whether any of the conditions will be false for your warp. Then if all the conditions will give true, execute a version of the kernel without any condition checks remaining. If any is false, execute the normal kernel.
Most of the time all the conditions will hold. Thus the kernel without any conditions will execute. This kernel will be much better optimized by the GPU compiler and be much faster. The 'slow' kernel will only execute a small minority of cases and not affect performance much.
Sadly I have no time to work on this at all, but I'm hopeful you use this technique and make an even better DCN.