return in any cases/platforms float32

Recently I was talking with a friend about floating-point numbers and suddenly this issue here gets clear to me.

If we assume IEEE 754, we get for a 4 byte float (float32):

s: 1 bit for the sign
m_i: 23 bits for the mantissa
e_i: 8 bits for the exponent

And with a few mathematical formulas we get the decimal number from a few bits:

sign: S = (-1)^s
mantissa: M = 1 + \sum_{i=0}^22){m_i * 2**i} / 2^23
exponent: E = \sum_{i=0}^7){e_i * 2**i} - (2^7-1) = \sum_{i=0}^7){e_i * 2**i} - 127
decimal number: S * M * 2^E

And similar for a double/float64:

sign: S = (-1)^s
mantissa: M = 1 + \sum_{i=0}^51){m_i * 2**i} / 2^52
exponent: E = \sum_{i=0}^10){e_i * 2**i} - (2^10-1) = \sum_{i=0}^10){e_i * 2**i} - 1023
decimal number: S * M * 2^E

So, this means for a float32 we can only use 23 bits of the original mantissa of an integer. Converting an int32 to float32 leads to a relative error of up to \abs{\frac{2^24+1 - float32(2^24+1)}{2^24+1}} ~= 6e-8.

Whereas using a float64 we could store the full precision of an int32.

But converting an int64 to float64 leads again to a relative error of up to \abs{\frac{2^53+1 - float64(2^53+1)}{2^53+1}} ~= 1e-16.

So in both cases (int32 -> float32, int64 -> float64) we have a relative error of the machine epsilon.

There are some tools, where you can play around (and maybe more and better ones):

NCAS-CMS / cfunits

return in any cases/platforms float32 #35