Closed bssrdf closed 2 months ago
This is just a consequence of the limited numerical precision during matrix multiplication, if you look at the results of IM2COL
they're identical. With the CPU backend the matrix multiplication is done by converting the 16 bit floats to 32 bit and then doing the matrix multiplication at 32 bit precision. With the CUDA backend the matrix multiplication is done at 16 bit precision and the result is then upcast to 32 bit precision. IEEE 754 half precision floats have 10 mantissa bits so when the accumulator has a value in the range 512-1024 the absolute precision is only 0.5. If you change the values for a and b to be ones which are exactly representable by floating point numbers (e.g. 4.0 and 2.0) the results should be exactly identical.
For a "realistic" matrix multiplication where the input values are essentially random and centered around 0 the difference in rounding error between FP32 and FP16 should be much smaller.
If you change the values for a and b to be ones which are exactly representable by floating point numbers (e.g. 4.0 and 2.0) the results should be exactly identical.
Or rather, with those values you need fewer significant digits to exactly represent the product so the summation is less susceptible to rouning error.
@JohannesGaessler, thank you for your explanation. Now I have a better understanding of mixed precision computation.
Hi, I am getting different conv_2d results using CUDA and CPU backends.
The setting I am using in test-conv2d.cpp is
The first 4 rows of output from
CUDA
backendThe first 4 rows of output from
CPU
backendCPU result should be correct.
CPU: Ryzen 9 7950X GPU: RTX 4090 CUDA Tookit: 12.3
Could anyone verify this?
Thanks.