C2 C1 and C0 float to fix point translation

I would like to inquire how to get C2, C1, and C0 in the right format to reuse the circuit for Tanh. Using minimax approximation, I have obtained C2, C1, and C0 in float format. But plugging these C2, C1, and C0 into the circuit gives an error result.

I have checked the C2 C1 and C0 in float format are right. So the error might come from the translation. C2, C1, and C0 are translated into the fix-point format with one sig bit, one sign bit, one integer bit, and then 27, 18, and 12 fractional bits respectively. For instance, for C0 = 0.00097656250000000000000000000000000, the fix point format is 29'b00000000000011111111111111111, for C1 =0.99999904632568359375000000000000, the fix point format is 20'b00111111111111111111, for C2 = -0.00097656250000000000000000000000000, the fix point format is 14'b10000000000111.

Could you please share more information about the float to fix point translation? Thanks a lot in advance.

Thanks for being interested in extending the capabilities of the SFU unit that we implemented. Clearly, your concerns are not related to an actual issue of our current hardware implementation but more related to a possible extension you plan to do, so I'll proceed to answer you and close the open issue.

Regarding your concerns, giving you accurate feedback is a bit hard, but as the SFU we implemented is based on a polynomial approximation, we expect to have some errors in the output. However, I'm assuming you took care of the following three main aspects, which can help you to debug and maybe find the root of your issue:

Range reduction if necessary for your function (Tanh) (it might require more than one set of coefficients according to the exponent range)
Select an adequate number of m intervals for the coefficients (C0, C1, and C2) using enhanced minimax approximation.
After the polynomial approximation, it is required to make a normalization and exponent adjustment for converting fixed points to floating points in the IEEE754.

Please refer to this paper, which we follow for our SFU implementations (https://ieeexplore.ieee.org/abstract/document/1388195).

Our SFU core repository contains the complete design of the SFU (same as the one used in the FlexGripPlus repository), including some Matlab scripts containing the original coefficients in Floating point as well as an automatic generation of fixed point LUTs; also it may give you better understanding about how the SFU works and how to extend it.

You can use the Quadratic_Interpolator engine we developed, but additional hardware adjustments might be necessary to support the Tanh function (range reduction hardware support, and result normalization and exponent adjustment)

If you are getting extremely high interpolation errors from the quadratic interpolation unit, you probably need more bits for the coefficients and, in turn, extend the computational unit capabilities. In such a case, the complete hardware should be extended. It is a doable solution but, at the same time, a challenging task since our version was optimized to reduce hardware resources, so it is not generic hardware. It would be great if you could contribute to making it generic and improving it for hardware exploration design :)

Finally, your conversion example from C0, C1, and C2 to a fixed point representation is close to my calculations, so I would say they are correct. Anyways I leave here my calculations as a reference.

C0 = 0.00097656250000000000000000000000000 x 2^27 => 29'b00000000000100000000000000000 C1 = 0.99999904632568359375000000000000 x 2^18 => 20'b01000000000000000000 C2 = -0.00097656250000000000000000000000000 x 2^13 => 14'b10000000001000

Note: C2 uses 13 fractional bits, no magnitude, and one sign bit (S.FFFFFFFFFFFFF); this was adopted since the integer part of C2 was always 0.

I hope this answers your questions

Software used to generate coefficients ( maplesoft's maple software ) In SFU, each coefficient LUT ( c0, c1, c2 ), which is a coefficient array, is created with this bit range ( m = 6 ) , so the number of array elements is 64 ( 2^6 ). The data type sizes (bit array size) of c0, c1, and c2 to be stored are t, p, and q, respectively.

Quadratic polynomial c0, c1, c2 coefficient values sin(x), cos(x), rsqrt(x), log2(x), exp2(x), 1/x and sqrt(x)

                                c0              c1              c2
        1/x                    +0.1xxxx...xx,  -0.xxxxx...xx,  +0.xxxxx...xx
        sqrt(x)                +1.0xxxx...xx,  +0.01xxx...xx,  -0.000xx...xx
        rsqrt(x)               +0.1xxxx...xx,  -0.0xxxx...xx,  +0.0xxxx...xx
        exp2(x)                +1.xxxxx...xx,  +x.xxxxx...xx,  +0.0xxxx..xx
        log2(x)                +0.xxxxx...xx,  +x.xxxxx...xx,  -0.xxxxx...xx
        sin(x), cos(x)         +0.xxxxx...xx,  +x.xxxxx...xx,  -0.0xxxx...xx

is it?

A second-order approximation polynomial for a transcendental function has the form f(x) = C0(XH) + C1(XH)XL + C2(XH)XL^2 The size of the 32-bit floating-point realm is n bits, and the input argument x to function f is: upper part of m-bit XH and Consists of (divided) into lower XL of (n-m) bits Generate coefficients C0, C1, C2 using fractional field XH (use XH as select index into coefficient array C0, C1, C2 LUT when transcendental function is called)
Is format conversion being done with a each transcendental function custom fixed-point format via the transcendental function 'adjust processing scope for some operations'?

The range of values for the coefficients C0, C1, and C2 are different for each of the six function operations. Coefficient calculation via Maple software for example Analysis of 3 coefficients based on the maximum and minimum values of 6 function operations To use the same ROM size 29 bits for C0 coefficients (3.26 two's complement format: 1-bit sign, 2-bit integer bits, 26-bit fraction bits) 20 bits for C1 coefficients (4.16 format: 1 bit signed, 3 bits integer, 16 bits decimal) C2 adopts 14 bits (6.8 format: 1 bit sign, 5 bits integer, 8 bits decimal). The six operations are sin(x), cos(x), rsqrt(x), log2(x), exp2(x), 1/x, and sqrt(x).

of c0, c1, c2 for each transcendental function applied to the FlexGrip. Please explain the number of sign bits, integer bits, and fractional bits.

For example, Please explain how the bit string, which is the number at a specific index of the lut array of cos c0, c1, c2, changes when converted to a float 32-bit value through the corresponding format above.

How to check the output value after setting the transcendental function type and input value? 6 page ( SFU_User_Manual V2.pdf ) - https://zenodo.org/record/3934441#.Ye7dNPgo8uU

How to edit modelsim script .tcl documentation to check eg cos(x) input?

Jerc007 / Open-GPGPU-FlexGrip-

C2 C1 and C0 float to fix point translation #4