This patch provides a new Thread Scheduler for NVIDIA GPUs.
Problem description
The problem is that, when using the latest NVIDIA Drivers (e.g., 550.76), the thread block is set to 32x32 for 2D kernels. This block size seems to be illegal only when using the latest NVIDIA drivers. This patch provides a custom NVIDIA scheduler to fix this. Performance over the default scheduler increases ~300GFLOPs on my RTX 3070 GPU for the canonical matrix multiplications with this patch.
Description
This patch provides a new Thread Scheduler for NVIDIA GPUs.
Problem description
The problem is that, when using the latest NVIDIA Drivers (e.g., 550.76), the thread block is set to 32x32 for 2D kernels. This block size seems to be illegal only when using the latest NVIDIA drivers. This patch provides a custom NVIDIA scheduler to fix this. Performance over the default scheduler increases ~300GFLOPs on my RTX 3070 GPU for the canonical matrix multiplications with this patch.
Backend/s tested
Mark the backends affected by this PR.
OS tested
Mark the OS where this PR is tested.
Did you check on FPGAs?
If it is applicable, check your changes on FPGAs.
How to test the new patch?