A technique to get even better speed

axeldavy commented 8 months ago

Dear authors,

Congrats on the job to optimize DCNv4, the changes make a lot of sense. I didn't expect that the previous DCNv3 was limited by the issue of memory instructions and too much computation, but it makes sense, specially for float16.

One issue I see with the current code is the memory accesses in ms_deform_attn_im2col_bilinear and ms_deform_attn_col2im_bilinear which are under different conditionals.

Similarly, the call to ms_deform_attn_col2im_bilinear is under a conditional itself.

These conditionals are of course necessary to avoid accessing outside the image.

My previous experience with GPU programming has shown me that it is hard for GPU compilers to optimize well this case, and reduce latency by issuing memory instructions significantly before their use.

Possibly recent CUDA compilers handle that just fine, and my past experience is no longer valid, and if you have seen very good GPU code generated, ignore my comment.

However if this still applies, a solution that I have found to give very significant performance boost in practice is the following: At the beginning of your kernel check whether any of the conditions will be false for your warp. Then if all the conditions will give true, execute a version of the kernel without any condition checks remaining. If any is false, execute the normal kernel.

Most of the time all the conditions will hold. Thus the kernel without any conditions will execute. This kernel will be much better optimized by the GPU compiler and be much faster. The 'slow' kernel will only execute a small minority of cases and not affect performance much.

Sadly I have no time to work on this at all, but I'm hopeful you use this technique and make an even better DCN.

axeldavy commented 8 months ago

Just to be more explicit, here is one example of implementing the technique on backward_kernel_dcn: Load all the w_im and h_im in a table. Compute minimums and maximum. Based on the minimums and maximums, determine if any of the "if (h_im > -1 && w_im > -1 && h_im < height_in && w_im < width_in)" test will be false Based on the minimums and maximums, determine if any of the ifs inside ms_deform_attn_col2im_bilinear will be false

If any of them is false (for any elements of the warp), execute the current code. If none of them is false, execute a different loop that doesn't have any of the checks.

Similar thing can be done to the other functions.

YuwenXiong commented 8 months ago

Thanks for your valuable suggestions, we will definitely try this!

axeldavy commented 8 months ago

Thanks for looking at it, I hope you get a significant performance boost.

Here are some other ideas that could also improve performance, but that need research work (maybe for one of your future publication ?).

Each sample of the kernel is bilinearly interpolated. Inspired from quantization techniques, like int8, for which after training the network completly, you finetune the model with enforced quantization, one idea that could speed up DCN would be to have after training a finetuning phase when you would use nearest neigbor instead of bilinear interpolation. That would divide the bandwidth requirements with cache by 4 and reduce computations. After this finetuning, the inference network would use DCN with nearest neighbor, and thus be faster. Maybe no finetuning is needed, it's something that could be tested.
All DCNs have made the choice to mimick a convolution kernel, which is usually square (3x3, 5x5, etc). This is arbitrary and if you think about it DCN could just define a number of samples N. I have tested this idea some time ago with DCNv2 and it worked. For example instead of having a 3x3 kernel, I would use N=4 samples. For the initial position, all samples were initialized to be at the center of the convolution. This enabled me to reduce a lot the cost of DCN. Sadly I have not been able to make a complete study to identify the potential drawbacks. Maybe having an offset for the samples would be better ? This would have to be studied. But with the 'square aspect' of DCN removed, and having an arbitrary N positions, that would allow more customisation for architecture search. Maybe some layers would like N=20 and others N=5, who knows ?

OpenGVLab / DCNv4

A technique to get even better speed #37