Qingquan-Li / blog

My Blog
https://Qingquan-Li.github.io/blog/
132 stars 16 forks source link

Binary floating-point numbers #233

Open Qingquan-Li opened 1 year ago

Qingquan-Li commented 1 year ago

Reference: zyBooks

1. Floating-point numbers and normalized scientific notation

An integer is a whole number, like 42, 0, or -95. A floating-point number is a real number, like 98.6, 0.0001, or -666.667. The term "floating-point" refers to the decimal point being able to appear anywhere ("float") in the number.

To improve readability and consistency, floating-point numbers are commonly written using normalized scientific notation, such as 9.86 × 101, 1.0 × 10-4, or -6.66667 × 102, where the number is written as a digit (+/- 1 to 9), decimal point, fractional part, times 10 to a power. The term "normalized" is in contrast to non-normalized where more than one digit, or a 0, may precede the decimal point, such as -66.6667 × 10-0.1 or 0.1 × 10-3.

The parts of scientific notation are named significand for the part before × and exponent for the power of 10: significand × 10exponent. If the exponent is 0, the power of ten part is sometimes omitted, as in 5.7.

In binary, normalized scientific notation consists of 1.f × 2exponent, like 1.010 × 25. f is the fractional part.

2. Fractions in binary

Fractions in binary are similar to fractions in decimal. In decimal, each digit after the decimal point has weight 1/101, 1/102, 1/103, etc. In binary, each digit after the binary point has weight 1/21, 1/22, 1/23, 1/24, etc. (so 1/2, 1/4, 1/8, 1/16, etc.). Ex: 1.1101 is 1 + 1/2 + 1/4 + 1/16. Note that for binary numbers, the "dot" is called a binary point (versus decimal point for decimal numbers) The general term is radix point.

Fractional digit weights for decimal and binary

If a decimal fraction cannot be exactly represented as a binary fraction using limited bits, one gets as close as possible, over or under. Ex: To represent 0.8 with only four binary fraction bits:

Obviously, more bits means binary fraction values can be closer.

3. Loss of precision

Using a fixed number of bits means that some numbers cannot be precisely represented in a computer. Ex:

Increasing the number of bits improves the precision with which such numbers can be represented, but at the cost of more storage.

4. Binary floating-point representation

In a computer, a binary floating-point number must be represented using a fixed number of bits. Normalized scientific notation is used, commonly using 32 or 64 bits. A common 32-bit floating-point binary representation has these items:

A 32-bit floating-point representation:

A 32-bit floating-point representation

The above is known as the IEEE single-precision binary floating-point format, which uses 32 bits: 1 bit for sign, 8 bits for exponent (biased by 127), and 23 bits for significand (leading 1 before binary point is implicit). IEEE double-precision binary floating-point format uses 64 bits: 1 bit for sign, 11 bits for exponent, and 52 bits for significand (leading 1 before binary point is implicit).

Note: In C, C++, and Java, a variable like "float x" uses 32-bits (single precision), while "double x" uses 64 bits (double precision). Programmers usually use double to obtain more significand precision, unless memory is tightly constrained.

Binary floating-point to decimal

Example:

Given the following binary floating-point representation, determine the normalized scientific notation and the decimal number.

0 10000001 00100000000000000000000

5. Decimal to binary floating-point

Decimal is converted to 32-bit floating-point by first converting the decimal number to binary, then normalizing and adjusting the exponent, and finally filling in the appropriate fields.

Converting decimal with fraction to binary floating-point:

Decimal to binary floating-point

Examples:

Ex1: For 1111.012, normalizing yields 1.11101. Thus, the unbiased exponent should be? 1111.01 = 1.11101 × 23. So the unbiased exponent should be 3.

Ex2: For 10010.01, normalizing yields 1.001001. Thus, the biased exponent should be? 1111.01 = 1.11101 × 24   (4+127=131). So the biased exponent should be 131.

Ex3: The first three bits of the fractional part of the significand of decimal 3 in binary floating-point is: 310 = 112 = 1.1 × 21 , so the first three bits of the fractional part is 100 (Not 001. 2-3 = 0.125 ≈ 0.1).

Ex4: Convert the decimal number -16.25 to the 32-bit binary floating-point format.