ARM-software / ComputeLibrary

The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
MIT License
2.75k stars 767 forks source link

Uninitialized Regs used causes unintended behavior #1093

Closed Nitesh8998 closed 2 months ago

Nitesh8998 commented 4 months ago

Context: I am adapting a specific piece of assembly code from the sve_hybrid_s8s32_mmla_6x4VL/generic.cpp kernel, for standalone functionality in my application.

Problem Description: In the adaptation process, I've identified a critical issue where certain SVE registers (specifically z19) are used without prior initialization under specific conditions, leading to unpredictable behavior due to random data remnants in these registers. This behavior manifests in certain "height" scenarios within the kernel, notably:

Height 1 Case: The kernel comments indicate "no accumulate," and within this path, instructions like"trn2 z20.d, z20.d, z19.d" and "trn2 z1.d, z1.d, z19.d" are executed, where z19 is utilized without being initialized, causing unintended data in the destination register.

Contrast with Height 2 Case and Beyond: In other scenarios (e.g., height 2), z19 is properly initialized with instructions like "ld1rqb { z19.b }, p0/Z, [x25]\n" before being used in subsequent operations.

Temporary Solution: Introducing a manual initialization step (e.g., "mov z19.s, #0x0\n") rectifies the issue in the "no accumulate" (height 1) scenario by zeroing out z19 before its use.

While the issue described has been specifically identified and described in the context of sve_hybrid_s8s32_mmla_6x4VL/generic.cpp kernel there is a concern that similar patterns of uninitialized register usage might exist in other kernels within the same library or framework.

Input from others who might have encountered and resolved similar issues, either in this specific library or in similar contexts, will be great!

DavidMansell commented 4 months ago

Can you elaborate on what unintended behaviour you see?

MMLA generates two rows of output at a time. Therefore, on these paths which compute an odd number of output rows, the last row is a "don't care". MMLA will compute some random values in the odd elements of the accumulators which are then discarded ("uzp1" reads out the even elements with no corresponding "uzp2" to read the odd elements).

Nitesh8998 commented 4 months ago

I see i understand now. The unintended behaviour was when I was viewing input register values of smmla I saw that because z19 was not initialized it was using random value to compute matrix multiplication and set the destination register. Since they act as " don't care " I suppose these values don't effect the kernel result as such.

Thank you for the clarification.

but what if they need not compute odd number of output rows?

Also, how would one understand what is the layout of the input matrix to the kernel (input_ptrB and input_ptrA in the kernel code). This would make things more clear as to why there is a need for don't care rows. For instance, I understand that A and B should abide by a certain layout when sending to these kernels to expect a correct matrix multiplication result.

Thanks.

morgolock commented 2 months ago

Hi @Nitesh8998

For further details about how sve works please refer to the SVE Optimization Guide and ARM Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE)