Different output between CPU and Metal with the im2col operator

lovemefan commented 3 months ago

In the code of my custom model, during debugging, I noticed that the im2col operator produces different results on Metal compared to the CPU.

my device is M1 pro.

code at https://github.com/lovemefan/SenseVoice.cpp/blob/1dfa60459a0104a68066ddf7c275f4c0a33972a6/sense-voice/csrc/sense-voice-encoder.cc#L267-L269

struct ggml_tensor * im2col = ggml_im2col(ctx0, new_a,
               ggml_reshape_4d(ctx0, b, b->ne[0], 1, b->ne[1], b->ne[2] * b->ne[3]),
               1, 0, padding, 0, 1, 0, false, GGML_TYPE_F16);

here is my callback log of tensor, the im2col's name is node55

cpu.log metal.log

I would appreciate your help with this. Thank you!

JohannesGaessler commented 3 months ago

Do the im2col tests in test-backend-ops pass?

lovemefan commented 3 months ago

Do the im2col tests in test-backend-ops pass?

all passed with -DGGML_METAL=OFF

one failed with -DGGML_METAL=ON

96% tests passed, 1 tests failed out of 24

Total Test time (real) =  86.43 sec

The following tests FAILED:
         16 - test-conv-transpose-1d (Subprocess aborted)

ggerganov commented 3 months ago

The test-conv-transpose-1d binary fails on Mac because we don't have Metal implementation yet.

@JohannesGaessler was asking for the result from test-backend-ops and they currently pass:

make -j && ./bin//test-backend-ops -o IM2COL

Backend 1/2 (CPU)
  Skipping CPU backend
Backend 2/2 (Metal)
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
  Backend name: Metal
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[3000,128,1,1],ne_kernel=[3,128,1280,1],s0=1,s1=0,p0=1,p1=0,d0=1,d1=0,is_2D=0): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[3000,128,1,1],ne_kernel=[3,128,1280,1],s0=1,s1=0,p0=1,p1=0,d0=1,d1=0,is_2D=0): OK
  1344/1344 tests passed
  Backend Metal: OK

ggml_metal_free: deallocating
2/2 backends passed
OK

@lovemefan Try to add a test to test-backend-ops.cpp that fails and work from there

lovemefan commented 2 months ago

The script can pass, but I can’t reproduce my bug in add a examples in test-backend-ops. I spent some time troubleshooting, but couldn’t find anything.

Testing 2 backends

Backend 1/2 (CPU)
  Backend name: CPU
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[3000,128,1,1],ne_kernel=[3,128,1280,1],s0=1,s1=0,p0=1,p1=0,d0=1,d1=0,is_2D=0): ggml_backend_register: registered backend CPU
ggml_backend_register: registered backend Metal
OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[3000,128,1,1],ne_kernel=[3,128,1280,1],s0=1,s1=0,p0=1,p1=0,d0=1,d1=0,is_2D=0): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[20,1,2,1],ne_kernel=[11,1,2,1],s0=1,s1=0,p0=5,p1=0,d0=1,d1=0,is_2D=0): OK
  1345/1345 tests passed
  Backend CPU: OK

Backend 2/2 (Metal)
  Skipping
2/2 backends passed
OK

in my code, I noticed the output tensor looks like random data.

node_56 = (f32)     IM2COL(encoder.encoders0.0.self_attn.fsmn_block.weight (reshaped){11, 1, 512, 1}, attention_V (transposed) (cont) (reshaped) (cont){187, 1, 512, 1}}) = {11, 187, 512, 1}
                                     [
                                      [
                                       [      0.1077,       0.0867,       0.0815, ...,       0.0939,       0.0617,       0.0004],
                                       [      0.0509,      -0.0198,       0.0252, ...,       0.0589,       0.0110,      -0.0662],
                                       [     -0.0691,      -0.1348,      -0.1385, ...,      -0.1628,      -0.1450,      -0.1191],
                                       ..., 
                                       [      0.0251,      -0.0317,      -0.0558, ...,      -0.1676,      -0.1620,      -0.0997],
                                       [     -0.1455,      -0.0764,      -0.1159, ...,      -0.0249,       0.0146,       0.0687],
                                       [      0.1323,       0.1453,       0.1234, ...,       0.2963,       0.1976,       0.1235],
                                      ],
                                      [
                                       [      0.3266,       0.3900,       0.3218, ...,       0.0291,       0.0054,      -0.0033],
                                       [      0.0183,       0.0379,       0.1066, ...,       0.2731,       0.2905,       0.3599],
                                       [      0.4786,       0.4105,       0.4773, ...,       0.2667,       0.3345,       0.2860],
                                       ..., 
                                       [      0.2771,       0.2001,       0.2363, ...,       0.1060,       0.1247,       0.0417],
                                       [      0.0584,       0.1006,       0.1617, ...,      -0.0197,      -0.0852,      -0.0765],
                                       [     -0.0800,      -0.0915,      -0.1155, ...,      -0.2073,      -0.2165,      -0.1289],
                                      ],
                                      [
                                       [     -0.1064,      -0.1542,      -0.1333, ...,      -0.0538,       0.0158,       0.0972],
                                       [      0.1904,       0.2528,       0.2121, ...,       0.4930,       0.4395,       0.2486],
                                       [     -0.0395,       0.0909,       0.2529, ...,       0.0461,       0.0395,      -0.0260],
                                       ..., 
                                       [      0.0827,       0.0932,       0.1548, ...,       0.1713,       0.0767,       0.1357],
                                       [      0.1237,       0.1689,       0.2138, ...,       0.1194,       0.1599,       0.2138],
                                       [      0.4326,       0.3982,       0.2848, ...,       0.2005,       0.3298,       0.2315],
                                      ],
                                      ..., 
                                      [
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       ..., 
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                      ],
                                      [
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       ..., 
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                      ],
                                      [
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       ..., 
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                      ],
                                     ]
                                     sum = 10.121820

when set dst type GGML_TYPE_F16 , NaN appear.

ggml_debug:                  node_56 = (f16)     IM2COL(encoder.encoders0.0.self_attn.fsmn_block.weight (reshaped){11, 1, 512, 1}, attention_V (transposed) (cont) (reshaped) (cont){187, 1, 512, 1}}) = {11, 187, 512, 1}
                                     [
                                      [
                                       [     -0.0001,       1.4648,      -0.0001, ...,   -1530.0000,       1.5137,   57856.0000],
                                       [      1.5762,   39104.0000,       1.5830, ...,       1.3711,     -40.0000,       0.7363],
                                       [     -0.0138,       1.3281,       0.0001, ...,       0.0350,       1.4951,      -1.8301],
                                       ..., 
                                       [      0.0001,       1.5469,      -6.4375, ...,     -33.4688,       1.5557,       0.5938],
                                       [      0.5054,      -0.0015,       1.5127, ...,       1.4238,      -0.0002,       1.4258],
                                       [     -5.2070,       1.5615,       8.8516, ...,       0.4473,      -1.3477,       0.6865],
                                      ],
                                      [
                                       [     -1.4102,      -2.2832,      -1.4033, ...,      -1.2461,   45632.0000,      -1.5615],
                                       [     -0.0000,      -1.4502,       0.0019, ...,   16544.0000,      -1.5928,   32320.0000],
                                       [     -1.5186,       0.0350,      -1.5273, ...,       1.3389,   -3242.0000,       1.2646],
                                       ..., 
                                       [     -1.4404,      14.8672,      -1.2959, ...,       1.1074,      -0.0106,       1.3867],
                                       [         nan,       1.5068,     -49.8125, ...,       0.0016,      -1.4277,      -8.9766],
                                       [      1.4160,      -0.0000,       1.6074, ...,       1.5723,     -62.8750,       1.4961],
                                      ],
                                      [
                                       [      0.4175,       1.6631,      -0.2144, ...,    -121.3750,       1.4053,      -0.4783],
                                       [      1.5615,       0.0000,       1.5938, ...,       0.9614,       0.0028,      -0.9175],
                                       [         nan,       1.1455,       0.0018, ...,    7648.0000,      -1.0283,      -0.6484],
                                       ..., 
                                       [     -0.1702,      -1.4219,      56.3750, ...,       0.0184,      -1.4717,      -9.1172],
                                       [     -1.5010,   22800.0000,      -1.5039, ...,      -1.5908,     -11.7031,      -1.6396],
                                       [     -0.1598,      -1.4199,      -0.0002, ...,    4010.0000,       1.6562,    1384.0000],
                                      ],
                                      ..., 
                                      [
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       ..., 
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                      ],
                                      [
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       ..., 
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                      ],
                                      [
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       ..., 
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                       [      0.0000,       0.0000,       0.0000, ...,       0.0000,       0.0000,       0.0000],
                                      ],
                                     ]
                                     sum = nan

what can I do to provide more information next?

lovemefan commented 1 month ago

This bug is the same as #991, but occurring with Metal

lovemefan commented 1 month ago

fixed in (llama/9943)

ggerganov / ggml

Different output between CPU and Metal with the im2col operator #931