Xilinx / mlir-aie

An MLIR-based toolchain for AMD AI Engine-enabled devices.
Other
260 stars 76 forks source link

Matmul examples don't satisfy numerics #1554

Closed makslevental closed 2 weeks ago

makslevental commented 3 weeks ago

Removing the guard here https://github.com/Xilinx/mlir-aie/blob/891b4e0ab2d725d248aa57b2cbf72420b736d006/programming_examples/basic/matrix_multiplication/common.h#L288

Running Kernel.
53878232: PID(781071): Submitted command (10)
53882470: PID(781071): Waiting for cmd (10)...
Verifying against reference matmul ...

Reference:
 1872.00   1904.00   2080.00   1936.00   1984.00   ...  1952.00   2008.00   2000.00   2016.00  
 1792.00   1832.00   2016.00   1832.00   1896.00   ...  1832.00   1856.00   1912.00   1888.00  
 1736.00   1808.00   1936.00   1912.00   1848.00   ...  1920.00   1832.00   1992.00   1848.00  
 1824.00   1864.00   1984.00   1912.00   1984.00   ...  1920.00   1920.00   2016.00   2040.00  
 1776.00   1840.00   1952.00   1816.00   1848.00   ...  1960.00   1936.00   2024.00   1968.00  
    ...       ...       ...       ...       ...    ...     ...       ...       ...       ...   
 1848.00   1896.00   1904.00   1824.00   1864.00   ...  1848.00   1920.00   1944.00   1944.00  
 1792.00   1776.00   2024.00   1832.00   1920.00   ...  1944.00   1896.00   2000.00   2040.00  
 1848.00   1872.00   1992.00   1848.00   1888.00   ...  1984.00   1920.00   1944.00   1968.00  
 1760.00   1920.00   1960.00   1856.00   1816.00   ...  1840.00   1968.00   1960.00   1928.00  

Output:
 2040.00   2032.00   2160.00   2064.00   2128.00   ...  2112.00   2096.00   2112.00   2128.00  
 1976.00   1976.00   2080.00   1976.00   2008.00   ...  2016.00   1992.00   2064.00   2032.00  
 1912.00   1960.00   2080.00   2008.00   1992.00   ...  2000.00   1984.00   2080.00   1992.00  
 1992.00   2048.00   2096.00   2016.00   2080.00   ...  2048.00   2080.00   2128.00   2128.00  
 1912.00   2008.00   2112.00   1968.00   1968.00   ...  2048.00   2032.00   2160.00   2096.00  
    ...       ...       ...       ...       ...    ...     ...       ...       ...       ...   
 2000.00   2040.00   2080.00   1984.00   2000.00   ...  2000.00   2016.00   2080.00   2048.00  
 1928.00   1888.00   2112.00   1944.00   2008.00   ...  2048.00   2040.00   2064.00   2048.00  
 1968.00   1968.00   2080.00   1968.00   2048.00   ...  2048.00   2048.00   2096.00   2032.00  
 1920.00   2000.00   2112.00   1968.00   1984.00   ...  1992.00   2080.00   2064.00   2048.00  
Verify time: 2.00secs.
Running Kernel.
makslevental commented 3 weeks ago

Note, issue persists upto at least 7635c9ecc32d51fd7810eb3663d379e8c5efd118:

(base) mlevental@mlevental-F7BSC:/tmp/mlir-aie/programming_examples/basic/matrix_multiplication/whole_array$ git rev-parse HEAD
7635c9ecc32d51fd7810eb3663d379e8c5efd118
(base) mlevental@mlevental-F7BSC:/tmp/mlir-aie/programming_examples/basic/matrix_multiplication/whole_array$ PYTHONPATH=/tmp/mlir-aie/build/python PATH=/opt/xilinx/xrt/bin:/tmp/mlir-aie/build/bin:$PATH make run

Reference:
 1896.00   1832.00   1856.00   1920.00   1848.00   ...  1960.00   1952.00   1912.00   1848.00  
 1904.00   1856.00   1856.00   1904.00   1768.00   ...  1960.00   1976.00   1952.00   1816.00  
 1824.00   1744.00   1856.00   1816.00   1672.00   ...  1872.00   1840.00   1896.00   1864.00  
 1840.00   1832.00   1864.00   1800.00   1720.00   ...  1896.00   1960.00   1880.00   1840.00  
 1936.00   1968.00   1800.00   1960.00   1872.00   ...  2040.00   1984.00   1968.00   1840.00  
    ...       ...       ...       ...       ...    ...     ...       ...       ...       ...   
 1800.00   1856.00   1800.00   1832.00   1672.00   ...  1904.00   1888.00   1936.00   1784.00  
 1888.00   1936.00   1824.00   1856.00   1776.00   ...  1880.00   1944.00   1976.00   1888.00  
 1944.00   1904.00   1840.00   1888.00   1832.00   ...  2000.00   2008.00   1952.00   1784.00  
 1880.00   1872.00   1872.00   1840.00   1872.00   ...  2016.00   2008.00   2008.00   1832.00  

Output:
 2040.00   2008.00   1968.00   2048.00   1968.00   ...  2080.00   2080.00   2080.00   1992.00  
 2016.00   2008.00   2032.00   2008.00   1936.00   ...  2064.00   2080.00   2112.00   1936.00  
 1928.00   1928.00   1960.00   1976.00   1816.00   ...  1992.00   2048.00   2016.00   1936.00  
 1968.00   1968.00   1984.00   1968.00   1904.00   ...  2024.00   2064.00   2000.00   1936.00  
 2040.00   2040.00   1920.00   2080.00   1952.00   ...  2096.00   2096.00   2080.00   1976.00  
    ...       ...       ...       ...       ...    ...     ...       ...       ...       ...   
 1888.00   1952.00   1896.00   1960.00   1840.00   ...  1992.00   2032.00   2016.00   1896.00  
 2000.00   2040.00   1976.00   1976.00   1896.00   ...  1992.00   2080.00   2096.00   1984.00  
 2008.00   2016.00   1976.00   2040.00   1952.00   ...  2112.00   2112.00   2048.00   1936.00  
 1984.00   2048.00   2016.00   1984.00   1968.00   ...  2080.00   2080.00   2096.00   1960.00  
Verify time: 2.00secs.
Running Kernel.
andrej commented 3 weeks ago

Thank you for pointing this out. I've coincidentally noticed that we had some large errors myself recently and started working on fixing them in #1551, which should hopefully reduce the error.

The tolerance was a bit high before and let some errors slide by.

However, we do still expect some divergence: In the new host code, we accumulate in float for the entire K-dimension. In the AIE code, we accumulate only blocks of 64 elements in float, then add those up together in bf16. That error can add up.

makslevental commented 3 weeks ago

In the AIE code, we accumulate only blocks of 64 elements in float, then add those up together in bf16. That error can add up.

I'm not superfamiliar with block float (what I think you're talking about?) and why it would lead to execessive error accumulation but I suppose at an input size of 64x64x64 that error would be minimal? I.e. each K reduction is equal to the block width? Unfortunately of course at those dimensions the example doesn't work at all:

Verifying against reference matmul ...
[    0,     0] 0.00 =!= 264.00
[    0,     1] 0.00 =!= 288.00
[    0,     2] 0.00 =!= 231.00
[    0,     3] 0.00 =!= 316.00
[    0,     4] 0.00 =!= 292.00
[    0,     5] 0.00 =!= 253.00
[    0,     6] 0.00 =!= 272.00
[    0,     7] 0.00 =!= 284.00
[    0,     8] 0.00 =!= 260.00
[    0,     9] 0.00 =!= 268.00
[    0,    10] 0.00 =!= 290.00
[    0,    11] 0.00 =!= 306.00
[    0,    12] 0.00 =!= 278.00
[    0,    13] 0.00 =!= 298.00
[    0,    14] 0.00 =!= 231.00
[    0,    15] 0.00 =!= 264.00
[    0,    16] 0.00 =!= 268.00
[    0,    17] 0.00 =!= 272.00
[    0,    18] 0.00 =!= 302.00
[    0,    19] 0.00 =!= 290.00
[    0,    20] 0.00 =!= 312.00
[    0,    21] 0.00 =!= 292.00
[    0,    22] 0.00 =!= 294.00
[    0,    23] 0.00 =!= 217.00
[    0,    24] 0.00 =!= 304.00
[    0,    25] 0.00 =!= 282.00
[    0,    26] 0.00 =!= 276.00
[    0,    27] 0.00 =!= 268.00
[    0,    28] 0.00 =!= 286.00
[    0,    29] 0.00 =!= 191.00
[    0,    30] 0.00 =!= 282.00
[    0,    31] 0.00 =!= 288.00
...and 4064 further errors.

Reference:
  264.00    288.00    231.00    316.00    292.00   ...   272.00    284.00    260.00    268.00  
  237.00    256.00    199.00    276.00    270.00   ...   237.00    258.00    231.00    248.00  
  242.00    272.00    216.00    270.00    280.00   ...   247.00    252.00    253.00    255.00  
  266.00    262.00    195.00    282.00    290.00   ...   251.00    248.00    256.00    256.00  
  205.00    232.00    162.00    214.00    248.00   ...   210.00    204.00    196.00    210.00  
    ...       ...       ...       ...       ...    ...     ...       ...       ...       ...   
  266.00    284.00    222.00    310.00    308.00   ...   260.00    294.00    260.00    274.00  
  260.00    276.00    219.00    284.00    290.00   ...   270.00    276.00    262.00    276.00  
  247.00    286.00    212.00    278.00    282.00   ...   223.00    268.00    240.00    256.00  
  237.00    260.00    191.00    260.00    278.00   ...   234.00    258.00    256.00    274.00  

Output:
    0.00      0.00      0.00      0.00      0.00   ...     0.00      0.00      0.00      0.00  
    0.00      0.00      0.00      0.00      0.00   ...     0.00      0.00      0.00      0.00  
    0.00      0.00      0.00      0.00      0.00   ...     0.00      0.00      0.00      0.00  
    0.00      0.00      0.00      0.00      0.00   ...     0.00      0.00      0.00      0.00  
    0.00      0.00      0.00      0.00      0.00   ...     0.00      0.00      0.00      0.00  
    ...       ...       ...       ...       ...    ...     ...       ...       ...       ...   
    0.00      0.00      0.00      0.00      0.00   ...     0.00      0.00      0.00      0.00  
    0.00      0.00      0.00      0.00      0.00   ...     0.00      0.00      0.00      0.00  
    0.00      0.00      0.00      0.00      0.00   ...     0.00      0.00      0.00      0.00  
    0.00      0.00      0.00      0.00      0.00   ...     0.00      0.00      0.00      0.00  
Verify time: 0.00secs.

Bumping to 128x128x128 produces the same and 256x256x256 produces

 MLIR compilation: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:-- 0:00:00 0/1 1 Worker/usr/include/c++/13/bits/stl_vector.h:1125: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = xilinx::AIE::BufferOp*; _Alloc = std::allocator<xilinx::AIE::BufferOp*>; reference = xilinx::AIE::BufferOp*&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
Aborted (core dumped)
make: *** [/tmp/mlir-aie/programming_examples/basic/matrix_multiplication/whole_array/../makefile-common:65: build/final_256x256x256.xclbin] Error 134
(base) mlevental@mlevental-F7BSC:/tmp/mlir-aie/programming_examples/basic/matrix_multiplication/whole_array$ 

(which indicates objectstatefultransform is malfunctioning).

andrej commented 3 weeks ago

I'm not superfamiliar with block float (what I think you're talking about?) and why it would lead to execessive error accumulation but I suppose at an input size of 64x64x64 that error would be minimal?

I'm not talking about block float but Google's brainfloat. At an input size of 64x64x64 there indeed should be minimal error. At a size of e.g. 4096 we would have 4096/64 = 64 accumulations that each get reduced to bfloat16, so 64 opportunities to add up error.

Unfortunately of course at those dimensions the example doesn't work at all:

This is because the whole_array design uses 4x4=16 compute cores, each of which process 64x64 sized tiles, so the minimum input size is 256x256x256. Once #1551 is merged you will be able to adjust the inner tile size to smaller sizes (little m, n, k).

(which indicates objectstatefultransform is malfunctioning).

That is this bug #1547, for which I have a workaround (for one simple case that makes 256x256x256 at least work) also in #1551.

If you are planning to work on matmul, I suggest taking a closer look at #1551 to not duplicate any work fixing errors I already fixed.