NCAS-CMS / cfunits

A Python interface to UNIDATA’s UDUNITS-2 library with CF extensions:
http://ncas-cms.github.io/cfunits
MIT License
12 stars 8 forks source link

return in any cases/platforms float32 #35

Open daniel-mohr opened 3 years ago

daniel-mohr commented 3 years ago

Do we really want to return in any cases/platforms float32? (Changing this could break usage of cfunits.)

I believe that every 4 byte integer could be stored in a 8 byte float (double) without losing precision. Therefore it could be more accurate to use numpy.float64 in all cases. Further for an integer with more than 4 bytes a float32 is not sufficient. What do you think?

daniel-mohr commented 3 years ago

Recently I was talking with a friend about floating-point numbers and suddenly this issue here gets clear to me.

If we assume IEEE 754, we get for a 4 byte float (float32):

And with a few mathematical formulas we get the decimal number from a few bits:

sign: S = (-1)^s
mantissa: M = 1 + \sum_{i=0}^22){m_i * 2**i} / 2^23
exponent: E = \sum_{i=0}^7){e_i * 2**i} - (2^7-1) = \sum_{i=0}^7){e_i * 2**i} - 127
decimal number: S * M * 2^E

And similar for a double/float64:

sign: S = (-1)^s
mantissa: M = 1 + \sum_{i=0}^51){m_i * 2**i} / 2^52
exponent: E = \sum_{i=0}^10){e_i * 2**i} - (2^10-1) = \sum_{i=0}^10){e_i * 2**i} - 1023
decimal number: S * M * 2^E

So, this means for a float32 we can only use 23 bits of the original mantissa of an integer. Converting an int32 to float32 leads to a relative error of up to \abs{\frac{2^24+1 - float32(2^24+1)}{2^24+1}} ~= 6e-8.

Whereas using a float64 we could store the full precision of an int32.

But converting an int64 to float64 leads again to a relative error of up to \abs{\frac{2^53+1 - float64(2^53+1)}{2^53+1}} ~= 1e-16.

So in both cases (int32 -> float32, int64 -> float64) we have a relative error of the machine epsilon.

There are some tools, where you can play around (and maybe more and better ones):