MikeLankamp / fpm

C++ header-only fixed-point math library
https://mikelankamp.github.io/fpm
MIT License
672 stars 85 forks source link

Conversion of max() and lowest() to float #33

Closed ahooper closed 2 years ago

ahooper commented 2 years ago

fpm is very nice! Converting numeric_limits values to float is behaving unexpectedly when there are zero integral bits.

#include <cstdint>
#include <fpm/fixed.hpp>
#include <fpm/ios.hpp>
#include <iostream>
#include <iomanip>

using fixed_0_15 = fpm::fixed<std::int16_t, std::int32_t, 15>;

int main() {
    auto max = std::numeric_limits<fixed_0_15>::max();
    auto lowest = std::numeric_limits<fixed_0_15>::lowest();
    std::cout << "fixed_0_15"
            << " max " << max
            <<" lowest "<< lowest
            << std::endl;
    std::cout << "static_cast<float>(fixed_0_15)"
              << " max " << static_cast<float>(max)
              <<" lowest "<< static_cast<float>(lowest)
              << std::endl;
    std::cout << "float(fixed_0_15)"
              << " max " << float(max)
              <<" lowest "<< float(lowest)
              << std::endl;
}
$ c++ -std=c++17 -Ifpm/include fpm_limits.cpp && ./a.out
fixed_0_15 max 0.999969 lowest -1
static_cast<float>(fixed_0_15) max -0.999969 lowest 1
float(fixed_0_15) max -0.999969 lowest 1

The fixed_0_15 is an attempt to model the q15 data type used in the ARM CMSIS DSP library. The trouble may be because FRACTION_MULT cannot be represented in the base type int16_t. I think FRACTION_MULT should use the IntermediateType. The tests pass with this change:

-    static constexpr BaseType FRACTION_MULT = BaseType(1) << FractionBits;
+    static constexpr IntermediateType FRACTION_MULT = BaseType(1) << FractionBits;
Cerclique commented 2 years ago

Having the exact same problem here !

_I'm not totally sure @TxtpGame's solution is the right one since it seems to disable the possibility to model fixed_015 data type. Edit: Just saw the pull request ... ignore my italic comment

@ahooper 's correction works like a charm for me (as temporary workaround).

MikeLankamp commented 2 years ago

I'm not familiar with the ARM CMSIS DSP library, but after looking it up, are you sure that the Q15 type has 0 integral bits?

This documentation has q15 as 16 bits wide. This include file (from the library?) documents q15_t as 1.15 format.

Furthermore, 0 integral bits is not really possible, since at least the sign bit must be present. So a fixed-point number with 1 integral bit has a range of [-1, 0.999...]. In theory it could do without so the range is [0, 0.999...] but this Q15 type doesn't seem to be that?

Cerclique commented 2 years ago

Unlike ahooper, I'm not using the ARM CMSIS DSP library. I did not put that much context, my bad.

In my case, I'm am dealing with specific data sent by a sensor. The documentation is explicitly telling me that the data is a 16 bits signed-fixed integer with 15 LSB fractionnal bits. and that, since it a signed-fixed value, the MSB denotes the sign.

I assumed the value has a range of [-1 to 0.999...] since I have no integral bits except the sign bit. So I need a data type like fpm::fixed<std::int16_t, std::int32_t, 15> and that casting to float give the wrong value for min()/max() functions like ahooper explained in the first post.

ahooper commented 2 years ago

I have not thought of the sign bit in two's complement as an integral bit, but it doe not change the problem if you do. In ARM CMSIS DSP the conversion functions between q15_t and float multiply and divide by 32768. So that value would represent 1 in q15_t except that it is beyond the range that can be represented in the 15 bits after the sign bit. The actual maximum value is 0x7fff/32768. = 0.999969482421875..., the same as in my fpm fixed_0_15 example.

arm_float_to_q15.c arm_q15_to_float.c (messy with SIMD alternatives, but there are processor-independent versions at the end)

MikeLankamp commented 2 years ago

Thanks for the clarification, and sorry for the delay on my part for looking at this.

Yes, indeed there's a problem with max() and lowest() when there's a single integral bit. PR #35 does fix it correctly, but is lacking the requested tests. @TxtpGame, I hope you don't mind, but I will make the change via my own PR with your fix plus my tests.

And to be clear, ZERO integral bits will not be supported. Even the fixed_0_15 type in @ahooper's original comment is really fixed_1_15.

Cerclique commented 2 years ago

Thanks for the correction ! fixed_1_15 makes total sense to me instead of fixed_0_15.