ggerganov / ggml

Tensor library for machine learning
MIT License
11.28k stars 1.05k forks source link

different conv-2d results between CUDA and CPU backends #970

Closed bssrdf closed 2 months ago

bssrdf commented 2 months ago

Hi, I am getting different conv_2d results using CUDA and CPU backends.

The setting I am using in test-conv2d.cpp is

 int KW = 3, KH = 3, IC = 32, OC = 32;
 int IW = 28, IH = 40, N = 1;

The first 4 rows of output from CUDA backend

load_model: ggml tensor size    = 336 bytes
load_model: backend buffer size = 0.16 MB
load_model: using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
main: compute buffer size: 1.37 MB

Performing test:
[480.0, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 480.0, ]
[480.0, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 480.0, ]
[480.0, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 480.0, ]
[480.0, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 721.5, 480.0, ]

The first 4 rows of output from CPU backend

load_model: ggml tensor size    = 336 bytes
load_model: backend buffer size = 0.16 MB
main: compute buffer size: 1.37 MB

Performing test:
[480.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 480.0, ]
[480.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 480.0, ]
[480.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 480.0, ]
[480.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 720.0, 480.0, ]

CPU result should be correct.

CPU: Ryzen 9 7950X GPU: RTX 4090 CUDA Tookit: 12.3

Could anyone verify this?

Thanks.

JohannesGaessler commented 2 months ago

This is just a consequence of the limited numerical precision during matrix multiplication, if you look at the results of IM2COL they're identical. With the CPU backend the matrix multiplication is done by converting the 16 bit floats to 32 bit and then doing the matrix multiplication at 32 bit precision. With the CUDA backend the matrix multiplication is done at 16 bit precision and the result is then upcast to 32 bit precision. IEEE 754 half precision floats have 10 mantissa bits so when the accumulator has a value in the range 512-1024 the absolute precision is only 0.5. If you change the values for a and b to be ones which are exactly representable by floating point numbers (e.g. 4.0 and 2.0) the results should be exactly identical.

For a "realistic" matrix multiplication where the input values are essentially random and centered around 0 the difference in rounding error between FP32 and FP16 should be much smaller.

JohannesGaessler commented 2 months ago

If you change the values for a and b to be ones which are exactly representable by floating point numbers (e.g. 4.0 and 2.0) the results should be exactly identical.

Or rather, with those values you need fewer significant digits to exactly represent the product so the summation is less susceptible to rouning error.

bssrdf commented 2 months ago

@JohannesGaessler, thank you for your explanation. Now I have a better understanding of mixed precision computation.