Closed xanderdunn closed 3 years ago
I realized that it may be throwing out the sign bit for all-positive values. Indeed, when I adjust the uniform distribution so that it's centered at 0 from [-50000, 50000] instead of [0, 100000], the compression ratio is approximately equal to that of the normal distribution:
uniform centered at 0 compression ratio: 1.0571481051280933 normal compression ratio: 1.0633603249629153
The only remaining question is, to what do we attribute the remaining ~6% gain in compression on the uniformly random data?
There are a few likely reasons for what you are observing.
First, using a uniform distribution, you are not exhausting the whole range of floating-point values. [-50000, 50000] spans just around 63 of 64 double-precision bits, which alone should give a compression ratio above one (64/63, to be exact).
Second, the uniform distribution you are generating is uniform in the exact mathematical sense, but the floating-point values you are mapping them to roughly follow the reciprocal distribution, which is quite nonuniform. Hence, the random variates you are generating are not uniformly random in the domain associated with the binary floating-point representation, which is the domain that fpzip operates on.
The normal distribution, on the other hand, has infinite support and should (at least in theory) span the whole floating-point range (-infinity, +infinity). Also, this distribution is more closely aligned with the reciprocal distribution, which after conversion to floating-point presumably retains more entropy.
It may be instructive to evaluate the entropy of, say, the leading 16 bits of the floating-point representation of your random variates, though using a larger sample than 100,000 (one million at least), with the assumption that the remaining 48 bits are uniformly random.
Note also that the sequence of values matter. For 1D data, fpzip predicts each value as the previous value and encodes their difference. It is not immediately clear to me how this might impact the resulting residual distribution, though it is easy enough to compute the distribution of the difference of two i.i.d. random variables using convolution. The resulting distribution should be closer to normal according to the central limit theorem.
Finally, differential entropy is not the same as discrete entropy. You cannot use it (directly) to compare the "compressibility" of random variates drawn from continuous distributions. Again, a better estimate of the entropy can be made by examining the leading bits of the floating-point numbers being compressed. Hope that helps.
This is hugely helpful, thank you! This helps my understanding of entropy, the floating point representation, and fpzip.
The distinction between the domain of the reals, which I am operating in, and the domain of the floating point representation, which fpzip is operating in, is an important one.
Is the idea of looking at the leading 16 bits that the remaining 48 bits of the fraction are likely not significant (uniformly random noise)?
To fully occupy the range of floating-point values, I created 1M floats where each float is created from 64 bits taken from the quantum random number generator:
#!/usr/bin/env python3
import os
from tqdm import tqdm
import quantumrand
import bitstring
import numpy as np
def to_floats(values):
bits = ""
floats = []
for value in values:
bits += bitstring.BitArray(uint=value, length=16).bin
if len(bits) == 64:
result_float = bitstring.BitArray(bin=bits).float
floats.append(result_float)
bits = ""
return floats
for i in tqdm(range(0, 1000 * 4)):
values = []
path = "./data/{}_quantum_random_numbers.npy".format(i)
if not os.path.exists(path):
values += quantumrand.get_data(data_type='uint16', array_length=1000)
floats = to_floats(values)
np_array = np.array(floats)
np.save(path, np_array)
files = os.listdir("./data/")
result = np.array([], dtype=np.float64)
for path in files:
result = np.concatenate((result, np.load("./data/{}".format(path))))
print(result)
print("Loaded {} files and got {} floats".format(len(files), result.shape))
np.save("./quantum_random_numbers_1M.npy", result)
And feeding these 1 million doubles into FPZip as a 1-d array we get:
quantumRandomCompressionRatio on 1,000,000 doubles: 0.9873771240024714
I believe this satisfies my quest to produce an ~1.0 compression ratio on random doubles.
Something to note with the above method of creating random doubles is that they may not be properly random in real space even if the bits are properly random in floating point space. For example, 467/1M values are nan, and I'm not sure we'd expect that many repeated values when randomly sampling the reals across (min double) to (max double).
I will additionally calculate some discrete entropies on the leading bits of 1M+ samples from my continuous variables to compare results.
From your earlier suggestion, I sampled 1M points from a few distributions and took the first 16 bits of every Double that was generated. x axis is bin count = number of unique elements.
Normal distribution, Discrete entropy: 7.4652923094776895:
Uniform from -500,000 to 500,000, Discrete entropy: 6.995889511155527:
Uniform from -1e100 to 1e100, Discrete entropy: 6.970322375058382:
The above histograms are created by finding the count of all unique 16-bit words ordered by count from most frequently occurring to least frequently occurring. Is this what we're after? Typically x-axis would not be sorted by frequency in a histogram, but rather it would be sorted by value. I tried to set x to the value of the UInt64
represented by the 16-bit binary string of each bin, but the values are too strongly clustered ~22,000 and ~54,000 and the histogram isn't meaningful:
I'm still wrapping my head around exactly what this tells us, as well as the relationship between the entropy of the mathematical values I'm observing and the entropy of the floating point representations of those values. But I believe what we see here is the reciprocal distribution you mentioned, and I believe we're also seeing the fact that the normal distribution retains more entropy after conversion to floating point.
Is the idea of looking at the leading 16 bits that the remaining 48 bits of the fraction are likely not significant (uniformly random noise)?
Exactly. Assuming numbers are drawn from a continuous distribution that varies reasonably slowly, we'd expect the distribution to be locally "flat" when examining a small enough interval, so that the trailing bits are uniformly random. I don't know whether all 48 bits are fully random, but you can easily compute the entropy of individual bit positions to find out.
To fully occupy the range of floating-point values, I created 1M floats where each float is created from 64 bits taken from the quantum random number generator:
I haven't had a chance to dig into what this code does exactly, but you could always generate random 64-bit integers uniformly and "reinterpret" them as doubles. I believe NumPy will let you do that on an array.
Something to note with the above method of creating random doubles is that they may not be properly random in real space even if the bits are properly random in floating point space. For example, 467/1M values are nan, and I'm not sure we'd expect that many repeated values when randomly sampling the reals across (min double) to (max double).
You mean we shouldn't expect that many nans? Keep in mind that double precision supports 2^53 = 9 trillion unique bit patterns for nans (how wasteful!), so we would expect 1M * 2^53/2^64 = 488 nans, which is close enough to 467.
The above histograms are created by finding the count of all unique 16-bit words ordered by count from most frequently occurring to least frequently occurring.
I'm not sure what to make of these histograms other than them clearly showing nonuniform distributions. To show the pmf, you would normally plot on the x axis the 16-bit string and on the y-axis the number of occurrences of that 16-bit string. If the distribution were uniform, it would be flat, but it's probably going to look a bit noisy due to the randomness of the sampling process. So you might first sort the bins, which perhaps is what you're suggesting. Either way, the computed entropy confirms my suspicion that the floating-point representation of the normally distributed random variate has lower entropy.
A more accurate estimate of fpzip compressibility would be to compute the entropy of (the leading floating-point bits of) differences of consecutive randomly drawn values.
Thanks again for all your input!
generate random 64-bit integers uniformly and "reinterpret" them as doubles. I believe NumPy will let you do that on an array.
This is effectively what the code above is doing. The source of the random 64 bits is the quantum random process as opposed to numpy's psuedo random int64 generator.
You mean we shouldn't expect that many nans? Keep in mind that double precision supports 2^53 = 9 trillion unique bit patterns for nans (how wasteful!), so we would expect 1M * 2^53/2^64 = 488 nans, which is close enough to 467.
That calculation makes sense, thanks. I was coming to the realization that we may expect this many nas in random floating point space, even though it wouldn't make sense to expect so many duplicate values in a mathematically uniform distribution (on the reals rather than on the floats).
you would normally plot on the x axis the 16-bit string and on the y-axis the number of occurrences of that 16-bit string
When I convert the 16-bit strings to UInt16s and plot those on the x axis, I get this: It's not useful because all the values are clustered around two poles: ~22000 and ~54000
Here are the same graphs as above with the 16 bit strings on the x axis (at two different aspect ratios): Normal, 1M samples, 7.466690669068123 discrete entropy:
Uniform from -1e100 to 1e100, 1M samples, 6.970728151983884 discrete entropy:
But I don't know how matplotlib is sorting the strings and there are too many bins for the x axis to be legible.
Thanks for your help, I'll go ahead and close this out.
Many thanks for providing these excellent compressors as open source software! I do not believe this is a bug report, but an inquiry on the characteristics of the compression algorithm and how to interpret the compression ratio.
I have made a very simple experiment to better understand the characteristics of fpzip's compression algorithm:
Please excuse the fact that the code is written in Swift. I wrote a Swift->C interface to call fpzip from Swift, but the same should be readily achievable directly in C. This code generates random data from a uniform distribution and from a normal distribution, compresses each independently as 1-dimensional arrays, and then reports the compression ratios. It produces this output:
Two questions arise:
Double.random
, TensorFlow's ThreefryRandomNumberGenerator, as well as downloading truly random values produced by quantum physical processes. These all gave me a compression ratio of ~1.1625541314267445
. I would hypothesize perhaps fpzip's minimum compression ratio on floats is 1.16 no matter how incompressible the data are due to some improvement over the floating point representation, but that doesn't hold up against the compression ratio of the normally distributed data:from scipy import stats
print(stats.uniform(loc=0.0, scale=100000).entropy()) # uniform distribution from 0 to 100,000 print(stats.norm(loc=0.0).entropy()) # std-deviation 1.0 normal distribution
11.512925464970229 1.4189385332046727