Make correlated noise generation more efficient

It has been pointed out by @gbernstein that the generation of correlated noise within GalSim could be made more efficient by exploiting Hermitian symmetry. This should be relatively straightforward to implement.

First up, I want to apologise in advance for what is going to be a longish series of posts... Getting @gbernstein 's innocuous-sounding suggestion to work (i.e. using Hermitian symmetry to make GalSim's correlated noise generation more efficient) has taken me pretty much 4 full days of head scratching and various tests and rollbacks on this branch. At the heart of the problem was a subtlety that I found almost delicious, or at least I did once I (finally) realised what the problem was.

A quick summary of how noise generation works in BaseCorrelatedNoise (which I'll shorthand as CN) instances:

You provide an image and as the CN to add noise to it (either according to the correlated noise instance's own correlation function, or CF, or designed to whiten noise with this CF - the operations are very similar)
The CN instance queries an internal cache to see if it has an already-generated array containing the square root of the power spectrum (labelled rootps) corresponding to this CF: if not, it generates one from the CF via an FFT; but if it has one already in the cache it uses it, saving on the FFT.
This rootps is then multiplied by a same-sized array realization of a complex Gaussian random variable z = x + i y where each of x and y are sampled from the standard normal distribution, and the result is inverse FFT-ed back, and the real part taken, to give a real realization with the desired power spectrum.

(There is an extra step or two in there for whitening, but this is not totally relevant to understanding the issue tackled on this branch.)

Once @gbernstein suggested it, I thought "of course!" We need only ever generate a half sized rootps (or rather, an array of length rootps.shape[1]/2 + 1 in its trailing dimension, using integer division rules) and then generate a Hermitian symmetric noise field using the fact that we know we want only a real final realization. This means we need generate fewer Gaussian random vector elements, by a factor of ~2, and also save a similar factor on the FFTs required by using numpy.fft.irfft2 and numpy.fft.rfft.

The problem I found was that, when going to higher precision than standard checks in the test suite of test_correlatednoise.py, it didn't work.

The most egregious example of failure, and ultimately the most telling, was in the test_convolve_cosmos() test. This is pretty full-on end-to-end test, it takes a COSMOS correlated noise instance and generates noise according to it. It then reads this noise into an InterpolatedImage instance, and convolves the noise field with an asymmetric, reconvolution algorithm-style PSF (i.e. a ground based PSF and a deconvolved COSMOS PSF) to create a smoothed field. The statistics of this smoothed field are compared to the expected correlation function if you take the COSMOS CN and internally convolve it with the same reconvolution algorithm-style PSF via the CN's .convolveWith() method. This comparison is made for many realizations of the noise field, and will typically break if one among any number of parts of the correlated_noise.py calculations are screwed up.

When ramping the precision of this test up to 2dp (which requires 500-1000 realizations to pass on master) the new rfft/irfft failed, and going to higher precision still did not help.

Here is an image of what the convolved, COSMOS noise field's CF is supposed to look like (the reference in the test): cf_ref

This is what is generated by master for the comparison image, made with 1000 realizations of the noise field convolved within an InterpolatedImage instance: cf_pass_master

...And here is what you get from the irfft2, Hermitian-symmetric noise realization method: cf_fail_old It looks close but there is a noticeable deficit at the far LHS and RHS, and I saw this elsewhere too.

Here is the difference image of the Hermitian minus reference: cf_fail-ref

And here is that, on the same scale, for the old master comparison: cf_pass_master-ref

Basically, it became apparent that I had broken something, and I spend two days isolating changes in the codebase to work out what. I spotted a couple of silly errors along the way, but none of them helped this pretty glaring problem.

Will make a new post as this is already long...

I could not, for the life of me, determine what I'd managed to screw up, and I tried everything I could think of: differences between even and odd sized arrays, some problem with an overall normalization factor given an asymmetric (non unitary) division by 1/N in the NumPy DFT conventions... None of these paid off.

So I went away and tried a few code experiments elsewhere. These can be found in devel/external/test_cf/test_noise_gen_irfft.py.

I started in 1D, and constructed a simple Gaussian power spectrum. I made many realizations of this using both the slower ifft-then-take-real-part and the more efficient, Hermitian irfft methods, then looked at the average of estimates of the power spectrum (PS) from each of these realizations. The slightly clunky code is below:

import numpy as np

# Define sizes of arrays, input parameters
sigma = 1.7            # our noise field is going to have a Gaussian CF & PS for simplicity
nu = 21                # number of array elements
u = np.fft.fftfreq(nu) # get the u0, u1, u2 etc. frequency values
nsamples = 50000       # number of realizations to average over
ps = np.exp(-2. * np.pi**2 * sigma**2 * u**2) # using result for FT of a Gaussian being another one
                                              # (note scale factors in this might be wrong, done
                                              # from memory, but not crucial for demonstration)

# 1D plots first
print "Generating 1D plots..."
psests = []
psests_r = []
for i in range(nsamples):

    realization = np.fft.ifft((np.random.randn(nu) + 1j * np.random.randn(nu)) * np.sqrt(ps)).real
    realization_irfft = np.fft.irfft(
        (np.random.randn(nu//2+1) + 1j * np.random.randn(nu//2+1)) * np.sqrt(.5 * ps[:nu//2+1]), nu)
    psests.append(np.abs(np.fft.fft(realization))**2)
    psests_r.append(np.abs(np.fft.fft(realization_irfft))**2)

psest = np.mean(np.array(psests), axis=0)
psest_r = np.mean(np.array(psests_r), axis=0)

Below is a plot of the reference power spectrum, the power spectrum estimated from the ifft realizations, and that generated from the irfft realizations. ps_irfft This plot, which I arrived at yesterday, was the key to my understanding what was going wrong. You can see that there is a clear deficit in power at u=0 in the irfft generated realizations. It is low by a factor of 0.5 exactly (or at least seemed to converge as I increased the number of realizations sampled).

You have probably already worked out what is going wrong...

The argument of irfft, which we can label a, is a compact representation of an assumed Hermitian symmetric array. This means that a[0] should be strictly real!

However, what we are supplying as a in our example, namely (np.random.randn(nu//2+1) + 1j * np.random.randn(nu//2+1)) * np.sqrt(.5 * ps[:nu//2+1]), nu), has an imaginary part at a[0] that is non-zero in general. NumPy silently ignores the imaginary part, which means we have a 50% power deficit at the u=0 mode.

This impacts the correlation function globally (since it was totally local to u=0 in the PS), as can be seen below: cf_irfft (note I haven't corrected the usual wrapped-round way storing these arrays, sorry!)

The analysis extends to 2D, which of course is the case we are interested in. Putting in a 2D Gaussian PS, here is what is estimated from ifft-generated realizations: psxy_fft ...and here is what you see in irfft-generated realizations: psxy_rfft Funky. Here is the difference: psxy_rfft-fft

The correlation functions in 2D are also affected: Here is the CF from 1000 realizations of ifft2-generated noise (here I have rolled the centre of the CF to the image centre): cfxy_fft

In contrast, the CF from the irfft2-generated noise immediately shows hints of something comparable to the GalSim test_correlatednoise.py results: cfxy_rfft The difference between these two CFs is striking: cfxy_rfft-fft Although a much simpler set up, I think this reproduces, in a toy model form with less noise, the sort of errors I was seeing in the Hermitian minus reference case of the tests in test_correlatednoise.py.

On to the fix...

The fix is simple, I think: because we are missing 50% of the power from all the modes in the PS with ux = 0, I simply multiply PS[:, 0] by two before taking its square root and multiplying it by the arrays of Gaussian random numbers.

There might be something I'm missing that means this is wrong, but it seems to work. In the 2D toy model presented above, the difference between the CFs from the two different generation techniques becomes the following: cfxy_rfft_fixed-fft ...dominated by noise and much smaller than before.

This fix is now in GalSim on this branch and all tests pass, including the convolve_cosmos test at 2dp agreement and using 1000 realizations (takes 200s on my laptop). The move to using irfft for noise generation has made modest speed improvements in the tests, but for cases where realizations of a noise field are repeatedly generated for postage stamps of the same dimensions (e.g. in GREAT3-like simulations) the efficiency savings will be greater.

For a comparison of speedups, here is the output of running python test_correlatednoise.py on master on my laptop (with the 2dp, 1000 realizations version of test_convolve_cosmos()):

browe$ python test_correlatednoise.py
time for test_uncorrelated_noise_zero_lag = 0.21
time for test_uncorrelated_noise_nonzero_lag = 0.20
time for test_uncorrelated_noise_symmetry_90degree_rotation = 0.01
time for test_xcorr_noise_basics_symmetry_90degree_rotation = 0.11
time for test_ycorr_noise_basics_symmetry_90degree_rotation = 0.10
time for test_arbitrary_rotation = 0.00
time for test_scaling = 0.00
time for test_jacobian = 0.00
time for test_draw = 0.00
time for test_output_generation_basic = 0.10
time for test_output_generation_rotated = 0.24
time for test_output_generation_magnified = 0.19
time for test_copy = 0.00
time for test_cosmos_and_whitening = 1.86
Mean square estimate from avg. of individual field mean squares = 0.58898877435
Mean square estimate from all fields = 0.58898877435
Ratio of mean squares = 1.000000e+00
Variance estimate from avg. of individual field variances = 0.550753723842
Variance estimate from all fields = 0.588936918754
Ratio of variances = 9.351659e-01
Zero lag CF from avg. of individual field CFs = 0.58898877435
Zero lag CF in reference case = 0.594331973533
Ratio of zero lag CFs = 9.910097e-01
Printing analysis of central 4x4 of CF:
mean diff =  -0.00662609561158
var diff =  1.1339501178e-06
min diff =  -0.00937850650732
max diff =  -0.00398401330772
mean ratio = 9.812375e-01
var ratio =  6.88179990133e-05
min ratio = 9.444555e-01
max ratio = 9.910097e-01
time for test_convolve_cosmos = 186.77
new_int_im.noise =  <galsim.correlatednoise._BaseCorrelatedNoise object at 0x10763d788>
new_int_im.noise =>  <galsim.correlatednoise._BaseCorrelatedNoise object at 0x10763dfc8>
time for test_uncorrelated_noise_tracking = 0.54
time for test_variance_changes = 0.00

For comparison, the same thing now run on this branch:

 browe$ python test_correlatednoise.py
time for test_uncorrelated_noise_zero_lag = 0.17
time for test_uncorrelated_noise_nonzero_lag = 0.16
time for test_uncorrelated_noise_symmetry_90degree_rotation = 0.01
time for test_xcorr_noise_basics_symmetry_90degree_rotation = 0.09
time for test_ycorr_noise_basics_symmetry_90degree_rotation = 0.08
time for test_arbitrary_rotation = 0.00
time for test_scaling = 0.00
time for test_jacobian = 0.00
time for test_draw = 0.00
time for test_output_generation_basic = 0.07
time for test_output_generation_rotated = 0.18
time for test_output_generation_magnified = 0.14
time for test_copy = 0.00
time for test_cosmos_and_whitening = 1.41
Mean square estimate from avg. of individual field mean squares = 0.603267070413
Mean square estimate from all fields = 0.603267070413
Ratio of mean squares = 1.000000e+00
Variance estimate from avg. of individual field variances = 0.563222043875
Variance estimate from all fields = 0.603246521578
Ratio of variances = 9.336515e-01
Zero lag CF from avg. of individual field CFs = 0.603267070413
Zero lag CF in reference case = 0.594331973533
Ratio of zero lag CFs = 1.015034e+00
Printing analysis of central 4x4 of CF:
mean diff =  0.00693726181114
var diff =  2.64527087626e-06
min diff =  0.00161822397996
max diff =  0.00919982014963
mean ratio = 1.017930e+00
var ratio =  4.76231580484e-06
min ratio = 1.009619e+00
max ratio = 1.023194e+00
time for test_convolve_cosmos = 184.73
new_int_im.noise =  <galsim.correlatednoise._BaseCorrelatedNoise object at 0x107640788>
new_int_im.noise =>  <galsim.correlatednoise._BaseCorrelatedNoise object at 0x107640fc8>
time for test_uncorrelated_noise_tracking = 0.48
time for test_variance_changes = 0.00

Small but noticeable speedups in a number of tests.

I have put the test_correlatednoise.py script back into standard, lower precision, "regression" mode, although I did need to increment the number of realizations by one in the 1dp test_convolve_cosmos() test to get it to pass (as the number of RNGs being generated in the test process has changed it's not surprising that this might have happened). But anyway I'm happy with doing that as I know in the current code this test passes at higher precision if you up the number of realizations drastically.

I think this branch is ready to PR, but the Issue page seemed like the best place to dump this tale of woe and frustration :) This whole exercise took me about a factor of five times longer than I had wanted, so apologies for that...

@gbernstein , your suggestion sounded so innocent!

(And apologies for the long posts, with hindsight this should all have been bloody obvious, but I wanted to get it off my chest here on the Issues page and make a good record of where this mysterious factor comes from...)

Well Barney, my apologies for triggering this journey of annoying enlightenment. I'm a little surprised that it's all the values with ux=0 that are affected. I would have thought that just the four points with ux, uy = 0 or N/2 are self-conjugate and therefore have no imaginary part and need double the power in the real part. Also depending on how Python treats the half-plane of data that is needed as input to irfft2, there could be funny business with the points at [1:N/2,0] because they should be the conjugates of the points at [N/2+1:N,0]. Likewise [1:N/2,N/2] is conjugate to [N/2+1:N,N/2].

A quick experiment with irfft2 suggests that the routine does not ignore either of these parts of the array that are supposed to be conjugates. It seems to behave as if it averages the positive-frequency and (conjugate) of the negative-frequency values to produce the "true" positive frequency value (in other words it projects away any anti-Hermitian components). Do you fill in the negative half-lines with the conjugates of the positive ones, or do you draw fresh random values for them? If you're doing the latter, then there may be a need for a factor-2 change of power on these half-lines.

Well Barney, my apologies for triggering this journey of annoying enlightenment. I was tearing my hair out, I admit, but no apology necessary. I have learned something!

I'm a little surprised that it's all the values with ux=0 that are affected. I would have thought that just the four points with ux, uy = 0 or N/2 are self-conjugate and therefore have no imaginary part and need double the power in the real part.

My reasoning was partly empirical, in that I couldn't find any other explanation for the following plot, which is the difference between the power spectra from the irfft and ifft-realized noise fields (this time plotted for an even sized array, but otherwise very similar to the example in the comments above): psxy_rfft-fft There is a big deficit along that line, and it appears to be one half the input power. I believe what makes the line x=0 special is that the x dimension is the trailing array dimension, and it is only the trailing array dimension in the multidimensional NumPy rfft/irfft that is put into the halfcomplex format. But I need to have a bit more head-scratching time, I think...

I would have thought that just the four points with ux, uy = 0 or N/2 are self-conjugate and therefore have no imaginary part and need double the power in the real part.

Ah, you make a good point about the N/2 element for even sized arrays, I need to check that too...

Also depending on how Python treats the half-plane of data that is needed as input to irfft2, there could be funny business with the points at [1:N/2,0] because they should be the conjugates of the points at [N/2+1:N,0]. Likewise [1:N/2,N/2] is conjugate to [N/2+1:N,N/2].

Yes, those could be affected. I wonder if a flat (or nearly flat) power spectrum in the toy model script in devel would help bring these out better, I will have a play...

A quick experiment with irfft2 suggests that the routine does not ignore either of these parts of the array that are supposed to be conjugates. It seems to behave as if it averages the positive-frequency and (conjugate) of the negative-frequency values to produce the "true" positive frequency value (in other words it projects away any anti-Hermitian components).

OK, right, that makes sense: it is akin to doing the full complex inverse DFT and discarding the resulting imaginary part.

Do you fill in the negative half-lines with the conjugates of the positive ones, or do you draw fresh random values for them? If you're doing the latter, then there may be a need for a factor-2 change of power on these half-lines.

Interesting. I was doing the latter, since I simply copied over the code from the former, but this wasn't joined up thinking. There were wholly independent random numbers both above and below the line uy=0. Now, relative to the old case, I did have an additional power factor of 2 overall so I think this is why the tests passed, but my rational for that was somewhat wooly (it was because < |N(0,1) + i N(0,1)|**2> is sqrt(2) ).

In short, I am still screwing up my generation of the halfcomplex format array, since it is not Hermitian symmetric across the line uy=0. This might explain why the multiplication by 2 above works - it is cancelling one error with another! I'll look into it... (Have to run soon, it's our astro department summer BBQ...)

It makes sense that the sqrt(2) on the half-line was compensating for having independent random numbers at the two conjugate sides.

If I understand this (no guarantee of this!) a quick way to patch this all up would be:

Fill all the components of the k array with random complex numbers just like in the good old days except you're only doing the half-plane. Then fix the Hermitian redundancies via:

k[N-1:N/2:-1, 0] = conj(k[1:N/2,0] ) k[N-1:N/2:-1, N/2] = conj(k[1:N/2,N/2]) k[0,0] = sqrt(2) * k[0,0].real k[0,N/2] = sqrt(2) * k[0,N/2].real k[N/2,0] = sqrt(2) * k[N/2,0].real k[N/2,N/2] = sqrt(2) * k[N/2,N/2].real

This sounds very similar to what I ran into trying to move the lensing_ps stuff to use rfft. See PowerSpectrumRealizer._make_hermitian(). There are also similar things in the C++ layer when we fill in the k-space values, we need to make sure the boundaries come out correctly. cf. SBProfile::SBProfileImpl::fillKGrid() in SBProfile.cpp.

It's always tricky to handle these conventions correctly. And my guess is that the swirly pattern still evident in this plot means that something still isn't quite right. Probably the N/2 stuff that Gary mentioned.

That "this plot" link didn't work for me, are you referring to the one from the toy model where it was supposed to have worked?

Sorry. I copied the wrong link. I fixed it above, but yes, that's the plot I meant. It seems like there is still structure there.

Many thanks again @gbernstein & @rmjarvis for the help on this, I spent most of my spare time this week drawing diagrams of (small) odd and even sized Hermitian arrays to make sure I understood this and had correctly implemented it. I arrived at the following, ending in a prescription which is very similar to the one you suggested Gary (note usual integer division rules and Pythonic array slicing notation applies):

We have a complex k array as described above in Gary's comment, filled as usual with independent samples from a Gaussian Random Variate in each of the real and imaginary parts.
k has shape (N, M/2+1). We are going to use k to generate a realization of a real-valued noise field, with shape (N, M).
To do this we need to make sure both that k matches the form of a Hermitian array arr[i, j] = numpy.conj(arr[-i,-j]) that has been stored in halfcomplex compact format, and that the variance of the Hermitian part of each element of k is the same as that of all the other elements. As Gary neatly described it above the non-Hermitian part is "projected" out in the inverse real-DFT, so it is the Hermitian part that must have the correct variance.
Whether N or M are odd or even, we apply the following operations to k:

k[-1:N/2:-1, 0] = numpy.conj(k[1:(N+1)/2, 0])
k[0, 0] = numpy.sqrt(2.) * k[0, 0].real

Then, if M is even valued, we do the following:

k[-1:N/2:-1, M/2] = numpy.conj(k[1:(N+1)/2, M/2])
k[0, M/2] = numpy.sqrt(2.) * k[0, M/2].real

If N is even valued we do:

k[N/2, 0] = numpy.sqrt(2.) * k[N/2, 0].real

If both N and M are even valued, we finally do:

k[N/2, M/2] = numpy.sqrt(2.) * k[N/2, M/2].real

This prescription leads to the right results at as high precision as I have tested, which is getting fairly high.

The test script in devel/ has been modified to use this fix, and also modified to provide a flatter power spectrum so as to better test the edge values at N/2 and M/2. Here is the reference power spectrum for the 22x22 array test (usual lower corner centring): psxy_ref_nu22_n10000000 I ran tests making up to 1e7 realizations of noise according to this power, using the irfft prescription above. Here is the resulting best estimate of the PS: psxy_rfft_gary_fixed_nu22_n10000000 You can't see the differences by eye, so here is the residual: psxy_rfft_gary_fixed-ref_nu22_n10000000

Here is the recovered PS and the residual from a similar test (1e7 realizations) on a 21x21 matrix: psxy_rfft_gary_fixed_nu21_n10000000 psxy_rfft_gary_fixed-ref_nu21_n10000000

I think these are working, and the current state of the test script in devel/external/test_cf/ allows you to rerun these tests. At a slightly earlier commit (42c1cff for ref) I made tests where the x and y dimensions were alternately odd/even (but never both odd or both even) to test the switching above. Here are residual PS results from a 21x22 test, 1e6 realizations: psxy_rfft_gary_fixed-ref_nu121_nu222_n1000000 Same for 22x21 test image: psxy_rfft_gary_fixed-ref_nu122_nu223_n1000000

This prescription is also now implemented in the correlatednoise.py code, and passes the tests there both at the lower and higher level of accuracy. I think this is ready for a PR, but comments welcome.

Those look very nicely random now. Aside from the requisite 180 degree symmetry, that is. Looks to me like it's working.

I will just say, that the earlier plot (the one you linked to) might have been partly patchy in nature because of the longer correlation length in the noise begin generated. But we do know there were other issues too.

But yes, I do agree that I think these look to be working, possibly working period, but almost certainly at sufficient accuracy for our needs in the correlated noise generation application...

GalSim-developers / GalSim

Make correlated noise generation more efficient #563