Closed BhaskarsarmaP closed 5 months ago
@BhaskarsarmaP It looks like there is an hidden undocumented assumption in the code. I won't switch the implementation to double precision because it would have an impact on performances and most of the case it is not needed.
I'll update the documentation of the function to inform that if the floats are expected to be very big then it may be better to update the float array before calling this function.
For the conversion from float to q31 format, the floating-point number is scaled by the factor of 2^31. Following the scaling, the result is cast into a q63_t type (which corresponds to a 64-bit integer).
(q63_t) (*pIn++ * 2147483648.0f)
Inputs that are in the range of 10^10 or higher will cause an overflow when scaled by 2^31, as the resulting value surpasses the maximum representable range of a 64-bit integer, leading to incorrect outputs.