LLNL / fpzip

Lossless compressor of multidimensional floating-point arrays
http://fpzip.llnl.gov
BSD 3-Clause "New" or "Revised" License
104 stars 16 forks source link

Understanding Compression Ratios #3

Closed xanderdunn closed 3 years ago

xanderdunn commented 3 years ago

Many thanks for providing these excellent compressors as open source software! I do not believe this is a bug report, but an inquiry on the characteristics of the compression algorithm and how to interpret the compression ratio.

I have made a very simple experiment to better understand the characteristics of fpzip's compression algorithm:

// Helper function for losslessly compressing a 1D array of doubles and returning the compression ratio
func compress(data: [Double], toFPZ fpz: UnsafeMutablePointer<FPZ>?) -> (Double, Int) {
    let nx: Int32 = Int32(data.count)
    let size = Int32(MemoryLayout<Double>.size)

    fpz!.pointee.type = FPZIP_TYPE_DOUBLE
    fpz!.pointee.prec = 64 // Int32(CHAR_BIT * size), this is full 64 bits of precision, which is lossless
    fpz!.pointee.nx = nx
    fpz!.pointee.ny = 1
    fpz!.pointee.nz = 1
    fpz!.pointee.nf = 1

    // write header
    fpzip_write_header(fpz)

    // perform actual compression
    let outbytes = fpzip_write(fpz, data)

    fpzip_write_close(fpz)

    let ratio = Double(nx) * Double(size) / Double(outbytes)
    return (ratio, outbytes)
}

var rng = ThreefryRandomNumberGenerator(seed: [42]) // This is provided by Swift for Tensorflow
let randomDistribution = UniformFloatingPointDistribution(lowerBound: 0.0, upperBound: Double(count)) // Also from Swift for Tensorflow
let tensorflowRandomValues: [Double] = (0..<Int(count * 10)).map { _ in randomDistribution.next(using: &rng) }
let tensorflowCompressionRatio = compressionRatio(data: tensorflowRandomValues)
print("uniform compression ratio:", tensorflowCompressionRatio)

// I expect the compression ratio to be better for a non-uniform distribution
let normalDistributionGenerator = NormalDistribution(mean: 0.0, standardDeviation: 1.0)
let normalValues: [Double] = (0..<count).map { _ in normalDistributionGenerator.next(using: &rng) }
let normalCompressionRatio = compressionRatio(data: normalValues)
print("normal compression ratio:", normalCompressionRatio)

Please excuse the fact that the code is written in Swift. I wrote a Swift->C interface to call fpzip from Swift, but the same should be readily achievable directly in C. This code generates random data from a uniform distribution and from a normal distribution, compresses each independently as 1-dimensional arrays, and then reports the compression ratios. It produces this output:

uniform compression ratio: 1.1626494222504358
normal compression ratio: 1.0636699538500198

Two questions arise:

from scipy import stats

print(stats.uniform(loc=0.0, scale=100000).entropy()) # uniform distribution from 0 to 100,000 print(stats.norm(loc=0.0).entropy()) # std-deviation 1.0 normal distribution

Producing output:

11.512925464970229 1.4189385332046727


As expected, the uniform distribution is higher entropy than the normal distribution. Similar results are obtained when calculating discrete entropies on discretized samples from uniform and normal distributions.

I realize attempting to compress random data is a fool's errand, this is merely an experiment to aid my understanding of how the compression works.
xanderdunn commented 3 years ago

I realized that it may be throwing out the sign bit for all-positive values. Indeed, when I adjust the uniform distribution so that it's centered at 0 from [-50000, 50000] instead of [0, 100000], the compression ratio is approximately equal to that of the normal distribution:

uniform centered at 0 compression ratio: 1.0571481051280933 normal compression ratio: 1.0633603249629153

The only remaining question is, to what do we attribute the remaining ~6% gain in compression on the uniformly random data?

lindstro commented 3 years ago

There are a few likely reasons for what you are observing.

First, using a uniform distribution, you are not exhausting the whole range of floating-point values. [-50000, 50000] spans just around 63 of 64 double-precision bits, which alone should give a compression ratio above one (64/63, to be exact).

Second, the uniform distribution you are generating is uniform in the exact mathematical sense, but the floating-point values you are mapping them to roughly follow the reciprocal distribution, which is quite nonuniform. Hence, the random variates you are generating are not uniformly random in the domain associated with the binary floating-point representation, which is the domain that fpzip operates on.

The normal distribution, on the other hand, has infinite support and should (at least in theory) span the whole floating-point range (-infinity, +infinity). Also, this distribution is more closely aligned with the reciprocal distribution, which after conversion to floating-point presumably retains more entropy.

It may be instructive to evaluate the entropy of, say, the leading 16 bits of the floating-point representation of your random variates, though using a larger sample than 100,000 (one million at least), with the assumption that the remaining 48 bits are uniformly random.

Note also that the sequence of values matter. For 1D data, fpzip predicts each value as the previous value and encodes their difference. It is not immediately clear to me how this might impact the resulting residual distribution, though it is easy enough to compute the distribution of the difference of two i.i.d. random variables using convolution. The resulting distribution should be closer to normal according to the central limit theorem.

Finally, differential entropy is not the same as discrete entropy. You cannot use it (directly) to compare the "compressibility" of random variates drawn from continuous distributions. Again, a better estimate of the entropy can be made by examining the leading bits of the floating-point numbers being compressed. Hope that helps.

xanderdunn commented 3 years ago

This is hugely helpful, thank you! This helps my understanding of entropy, the floating point representation, and fpzip.

The distinction between the domain of the reals, which I am operating in, and the domain of the floating point representation, which fpzip is operating in, is an important one.

Is the idea of looking at the leading 16 bits that the remaining 48 bits of the fraction are likely not significant (uniformly random noise)?

To fully occupy the range of floating-point values, I created 1M floats where each float is created from 64 bits taken from the quantum random number generator:

#!/usr/bin/env python3

import os

from tqdm import tqdm
import quantumrand
import bitstring
import numpy as np

def to_floats(values):
    bits = ""
    floats = []
    for value in values:
        bits += bitstring.BitArray(uint=value, length=16).bin
        if len(bits) == 64:
            result_float = bitstring.BitArray(bin=bits).float
            floats.append(result_float)
            bits = ""
    return floats

for i in tqdm(range(0, 1000 * 4)):
    values = []
    path = "./data/{}_quantum_random_numbers.npy".format(i)
    if not os.path.exists(path):
        values += quantumrand.get_data(data_type='uint16', array_length=1000)
        floats = to_floats(values)
        np_array = np.array(floats)
        np.save(path, np_array)

files = os.listdir("./data/")
result = np.array([], dtype=np.float64)
for path in files:
    result = np.concatenate((result, np.load("./data/{}".format(path))))
print(result)
print("Loaded {} files and got {} floats".format(len(files), result.shape))
np.save("./quantum_random_numbers_1M.npy", result)

And feeding these 1 million doubles into FPZip as a 1-d array we get:

quantumRandomCompressionRatio on 1,000,000 doubles: 0.9873771240024714

I believe this satisfies my quest to produce an ~1.0 compression ratio on random doubles.

Something to note with the above method of creating random doubles is that they may not be properly random in real space even if the bits are properly random in floating point space. For example, 467/1M values are nan, and I'm not sure we'd expect that many repeated values when randomly sampling the reals across (min double) to (max double).

I will additionally calculate some discrete entropies on the leading bits of 1M+ samples from my continuous variables to compare results.

xanderdunn commented 3 years ago

From your earlier suggestion, I sampled 1M points from a few distributions and took the first 16 bits of every Double that was generated. x axis is bin count = number of unique elements.

Normal distribution, Discrete entropy: 7.4652923094776895: image

Uniform from -500,000 to 500,000, Discrete entropy: 6.995889511155527: image

Uniform from -1e100 to 1e100, Discrete entropy: 6.970322375058382: image

The above histograms are created by finding the count of all unique 16-bit words ordered by count from most frequently occurring to least frequently occurring. Is this what we're after? Typically x-axis would not be sorted by frequency in a histogram, but rather it would be sorted by value. I tried to set x to the value of the UInt64 represented by the 16-bit binary string of each bin, but the values are too strongly clustered ~22,000 and ~54,000 and the histogram isn't meaningful: _Vh5O_ZUkq

I'm still wrapping my head around exactly what this tells us, as well as the relationship between the entropy of the mathematical values I'm observing and the entropy of the floating point representations of those values. But I believe what we see here is the reciprocal distribution you mentioned, and I believe we're also seeing the fact that the normal distribution retains more entropy after conversion to floating point.

lindstro commented 3 years ago

Is the idea of looking at the leading 16 bits that the remaining 48 bits of the fraction are likely not significant (uniformly random noise)?

Exactly. Assuming numbers are drawn from a continuous distribution that varies reasonably slowly, we'd expect the distribution to be locally "flat" when examining a small enough interval, so that the trailing bits are uniformly random. I don't know whether all 48 bits are fully random, but you can easily compute the entropy of individual bit positions to find out.

To fully occupy the range of floating-point values, I created 1M floats where each float is created from 64 bits taken from the quantum random number generator:

I haven't had a chance to dig into what this code does exactly, but you could always generate random 64-bit integers uniformly and "reinterpret" them as doubles. I believe NumPy will let you do that on an array.

Something to note with the above method of creating random doubles is that they may not be properly random in real space even if the bits are properly random in floating point space. For example, 467/1M values are nan, and I'm not sure we'd expect that many repeated values when randomly sampling the reals across (min double) to (max double).

You mean we shouldn't expect that many nans? Keep in mind that double precision supports 2^53 = 9 trillion unique bit patterns for nans (how wasteful!), so we would expect 1M * 2^53/2^64 = 488 nans, which is close enough to 467.

lindstro commented 3 years ago

The above histograms are created by finding the count of all unique 16-bit words ordered by count from most frequently occurring to least frequently occurring.

I'm not sure what to make of these histograms other than them clearly showing nonuniform distributions. To show the pmf, you would normally plot on the x axis the 16-bit string and on the y-axis the number of occurrences of that 16-bit string. If the distribution were uniform, it would be flat, but it's probably going to look a bit noisy due to the randomness of the sampling process. So you might first sort the bins, which perhaps is what you're suggesting. Either way, the computed entropy confirms my suspicion that the floating-point representation of the normally distributed random variate has lower entropy.

A more accurate estimate of fpzip compressibility would be to compute the entropy of (the leading floating-point bits of) differences of consecutive randomly drawn values.

xanderdunn commented 3 years ago

Thanks again for all your input!

generate random 64-bit integers uniformly and "reinterpret" them as doubles. I believe NumPy will let you do that on an array.

This is effectively what the code above is doing. The source of the random 64 bits is the quantum random process as opposed to numpy's psuedo random int64 generator.

You mean we shouldn't expect that many nans? Keep in mind that double precision supports 2^53 = 9 trillion unique bit patterns for nans (how wasteful!), so we would expect 1M * 2^53/2^64 = 488 nans, which is close enough to 467.

That calculation makes sense, thanks. I was coming to the realization that we may expect this many nas in random floating point space, even though it wouldn't make sense to expect so many duplicate values in a mathematically uniform distribution (on the reals rather than on the floats).

you would normally plot on the x axis the 16-bit string and on the y-axis the number of occurrences of that 16-bit string

When I convert the 16-bit strings to UInt16s and plot those on the x axis, I get this: _Vh5O_ZUkq It's not useful because all the values are clustered around two poles: ~22000 and ~54000

Here are the same graphs as above with the 16 bit strings on the x axis (at two different aspect ratios): Normal, 1M samples, 7.466690669068123 discrete entropy: image image

Uniform from -1e100 to 1e100, 1M samples, 6.970728151983884 discrete entropy: image image

But I don't know how matplotlib is sorting the strings and there are too many bins for the x axis to be legible.

Thanks for your help, I'll go ahead and close this out.