Reference: zyBooks

1. Floating-point numbers and normalized scientific notation

An integer is a whole number, like 42, 0, or -95. A floating-point number is a real number, like 98.6, 0.0001, or -666.667. The term "floating-point" refers to the decimal point being able to appear anywhere ("float") in the number.

Integers are typically used for values that can be counted, like 42 cars, 0 pizzas, or -95 days.
Floating-point numbers are typically used for values that are measured, like 98.6 degrees, 0.0001 meters, or -666.667 grams.

To improve readability and consistency, floating-point numbers are commonly written using normalized scientific notation, such as 9.86 × 10¹, 1.0 × 10^-4, or -6.66667 × 10², where the number is written as a digit (+/- 1 to 9), decimal point, fractional part, times 10 to a power. The term "normalized" is in contrast to non-normalized where more than one digit, or a 0, may precede the decimal point, such as -66.6667 × 10^-0.1 or 0.1 × 10^-3.

The parts of scientific notation are named significand for the part before × and exponent for the power of 10: significand × 10^exponent. If the exponent is 0, the power of ten part is sometimes omitted, as in 5.7.

In binary, normalized scientific notation consists of 1.f × 2^exponent, like 1.010 × 2⁵. f is the fractional part.

2. Fractions in binary

Fractions in binary are similar to fractions in decimal. In decimal, each digit after the decimal point has weight 1/10¹, 1/10², 1/10³, etc. In binary, each digit after the binary point has weight 1/2¹, 1/2², 1/2³, 1/2⁴, etc. (so 1/2, 1/4, 1/8, 1/16, etc.). Ex: 1.1101 is 1 + 1/2 + 1/4 + 1/16. Note that for binary numbers, the "dot" is called a binary point (versus decimal point for decimal numbers) The general term is radix point.

Fractional digit weights for decimal and binary

If a decimal fraction cannot be exactly represented as a binary fraction using limited bits, one gets as close as possible, over or under. Ex: To represent 0.8 with only four binary fraction bits:

0.1100 is 1/2 + 1/4 = 0.75, which is under by 0.05.
0.1110 is 0.875, which is over by 0.075.
0.1101 is 0.8125, also over but only by 0.0125, which is closest and so is the best representation.

Obviously, more bits means binary fraction values can be closer.

3. Loss of precision

Using a fixed number of bits means that some numbers cannot be precisely represented in a computer. Ex:

A fraction that requires more bits than the allocated space allows. Ex: 1/8 = 0.125 when only one fraction bit is available.
A rational number with an infinitely-repeating fraction portion. Ex: 2/3 = 0.66666 …
An irrational number. Ex: π = 3.14159 …, √2 = 1.41421 …, and e = 2.71828 …

Increasing the number of bits improves the precision with which such numbers can be represented, but at the cost of more storage.

4. Binary floating-point representation

In a computer, a binary floating-point number must be represented using a fixed number of bits. Normalized scientific notation is used, commonly using 32 or 64 bits. A common 32-bit floating-point binary representation has these items:

Sign: 1 bit. 0 means positive, 1 negative.
Exponent: 8 bits. Instead of two's complement, the exponent is biased, meaning a fixed value is subtracted, in this case 127. Thus, an exponent of 00000000 means 2^{0 - 127}= 2^-127, and of 11111111 means 2^{255 - 127} = 2¹²⁸.
Fraction: 23 bits. Because the significand's first digit is always 1, the 1 implicitly precedes the fraction and thus isn't explicitly represented. The significand in scientific notation is commonly called the mantissa.

A 32-bit floating-point representation:

The above is known as the IEEE single-precision binary floating-point format, which uses 32 bits: 1 bit for sign, 8 bits for exponent (biased by 127), and 23 bits for significand (leading 1 before binary point is implicit). IEEE double-precision binary floating-point format uses 64 bits: 1 bit for sign, 11 bits for exponent, and 52 bits for significand (leading 1 before binary point is implicit).

Note: In C, C++, and Java, a variable like "float x" uses 32-bits (single precision), while "double x" uses 64 bits (double precision). Programmers usually use double to obtain more significand precision, unless memory is tightly constrained.

Binary floating-point to decimal

Example:

Given the following binary floating-point representation, determine the normalized scientific notation and the decimal number.

0 10000001 00100000000000000000000

Sign bit is 0. It is a positive number.
Exponent bits is 10000001. Biased exponent: 10000001₂ = 129. Exponent in decimal is 129 -127 = 2.
Significand's fraction bits is 00100000000000000000000. In decimal is 2^-3 = 1/8 = 0.125. The significand's first bit is implicitly a 1, so the decimal significand is 1.125.
Normalized scientific notation: 1.125 × 2².
Decimal number: 4.5.

5. Decimal to binary floating-point

Decimal is converted to 32-bit floating-point by first converting the decimal number to binary, then normalizing and adjusting the exponent, and finally filling in the appropriate fields.

Converting decimal with fraction to binary floating-point:

Examples:

Ex1: For 1111.01₂, normalizing yields 1.11101. Thus, the unbiased exponent should be? 1111.01 = 1.11101 × 2³. So the unbiased exponent should be 3.

Ex2: For 10010.01, normalizing yields 1.001001. Thus, the biased exponent should be? 1111.01 = 1.11101 × 2^{4 (4+127=131)}. So the biased exponent should be 131.

Ex3: The first three bits of the fractional part of the significand of decimal 3 in binary floating-point is: 3₁₀ = 11₂ = 1.1 × 2¹ , so the first three bits of the fractional part is 100 (Not 001. 2^-3 = 0.125 ≈ 0.1).

Ex4: Convert the decimal number -16.25 to the 32-bit binary floating-point format.

Sign bit: It is 1 because -16.25 is a negative number.
Biased exponent: -16.25₁₀ = -10000.01₂ = 1.000001 × 2⁴. So the unbiased exponent (in base ten) is 4. Then the biased exponent (in base ten) is 4 + 127 = 131.
Exponent bits: 131₁₀ = 10000011₂. So the exponent is 10000011.
Fraction bits: -16.25 is -10000.01 in binary. Normalizing yields -1.000001. Thus, the fraction part is 000001, followed by enough 0's to make 23 bits: 00000100000000000000000.
Significand: -16.25 is -10000.01 in binary. Normalizing yields -1.000001. The leading 1 is implicit in the format, so only the fraction part is written into the format's 23 bits for the significand: 1.00000100000000000000000
32-bit binary floating-point format:
```
1 10000011 00000100000000000000000
```

Qingquan-Li / blog