Fixed a small bug where slicing the K-dimension used a chunk size smaller than K.
Made a slight improvement to the numeric check for half-precision GEMM by using the average absolute difference instead of a single difference, as the latter is unstable.