In the multiexp kernel, we calculate the number of elements per group. The division might have a remainder, hence ceil the result, so that we can be sure that there are enough groups to fit all elements.
In the current code, that ceil operation is performed by casting the number of groups to a float, so that it's a floating point division. Then the result is rounded up (ceiled).
The problem is that floating point numbers don't have arbitrary precision. If you calculate 131076643 / 327.0 with an arbitrary precision calculator, then the result is 400846.003058. That one rounded up gives the expected result of 400847.
Though with 32-bit IEEE-754 floating point numbers, 400846.003058 has the same byte representation as 400846 (0x48C3B9C0). This means that a call to ceil, will have the wrong result 400846.
The fix is to use integer based ceiling instead. The usual way (if you know that the number won't overlow) is:
(number + divisor - 1) / divisor
In our case we know that we won't overflow (we'll never have that many elements on the GPU), hence we can use it.
(131076643 + 327 - 1) / 327 = 400847
With this error by one, there isn't enough space for all elements, so elements get unused. This then leads obvuiously to wrong results.
In the multiexp kernel, we calculate the number of elements per group. The division might have a remainder, hence ceil the result, so that we can be sure that there are enough groups to fit all elements.
In the current code, that ceil operation is performed by casting the number of groups to a float, so that it's a floating point division. Then the result is rounded up (ceiled).
The problem is that floating point numbers don't have arbitrary precision. If you calculate 131076643 / 327.0 with an arbitrary precision calculator, then the result is 400846.003058. That one rounded up gives the expected result of 400847.
Though with 32-bit IEEE-754 floating point numbers, 400846.003058 has the same byte representation as 400846 (0x48C3B9C0). This means that a call to
ceil
, will have the wrong result 400846.The fix is to use integer based ceiling instead. The usual way (if you know that the number won't overlow) is:
In our case we know that we won't overflow (we'll never have that many elements on the GPU), hence we can use it.
With this error by one, there isn't enough space for all elements, so elements get unused. This then leads obvuiously to wrong results.