Closed smeso closed 2 months ago
cc @0cc4m
It's true that the shader was written for devices with warp size of 32 or 64. It breaks for smaller values. Does it even output correct results with warp size 16 or is the result still wrong?
I don't think I have a way to test this.
When I tested it, it passed all tests (e.g. test-backend-ops). So it seems to work, but I don't know if there is any corner case in which it would return incorrect results. I can add and run more tests if you can think of anything else that is worth trying.
When the device's warp size is less than 16, it is possible for loadstride_a and loadstride_b to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOADVEC* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size.
The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication.
When they are 0, they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0.
We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8).