Efficiency in convolution

brgillis commented 7 years ago

A colleague of mine is working on a shear estimation algorithm which uses a model-fitting approach, and efficiency is a very big concern - he basically needs each galaxy model to be generated in ~10ms. He's able to do this with his own code, but for the Euclid pipeline we want everyone to be using the same galaxy models if possible, which means using the GalSim implementations of profiles. The problem is that a few efficiency measures in his code are not currently possible with GalSim (at least as far as I can tell). These are:

Using FFTW_MEASURE to speed up successive FFTs. From my reading of the code, it looks like GalSim always uses FFTW_ESTIMATE.
Reusing arrays in convolution. It looks to me like GalSim re-allocates arrays for each convolution, which takes up a fair amount of time.

There might be something I'm missing in GalSim that can help with this, so I'll wait for input before talking about my thoughts on implementing these in GalSim in case it's unnecessary.

rmjarvis commented 7 years ago

Before tackling specific improvements, the best bet would be to develop a self-contained example code that you want to optimize. Then we can actually profile it and see what the tall poles are for optimization. I'm always happy to add efficiency improvements, but without a target use case to profile, my experience is that one tends to focus on things that aren't actually helpful.

Aside: for very fast model fitting using GalSim, my guess is that you will be better off doing the modeling in k-space rather than real space. Then you don't need any FFTs. For simple models, GalSim draws the k-space image in <~ 1ms typically. cf. https://github.com/GalSim-developers/GalSim/issues/873#issuecomment-287159168. I.e. FFT your data once and generate models in k-space, rather than repeatedly FFT the models to compare to the data.

brgillis commented 7 years ago

Here's some sample code for it, comparable to the model generation step of what my colleague (Giuseppe) is doing (not included is the actual comparison of the generated model to the actual image):

import galsim
from cProfile import runctx

def main():

    # Using very low accuracy GSParams here for speed
    gsparams = galsim.GSParams(minimum_fft_size=256,
                               folding_threshold=0.1,
                               kvalue_accuracy=1e-3,
                               stepk_minimum_hlr=2.5,)

    # Note - we actually use an interpolated image instead; just putting this in
    # so you can run the code without needing that file
    psf_prof = galsim.OpticalPSF(lam=725, # nm
                                 diam=1.2, # m
                                 defocus=0,
                                 obscuration=0.33,
                                 nstruts=3,
                                 gsparams=gsparams)

    convolved_image = galsim.Image(256,256,scale=0.02)

    for _i in range(1000):

        gal_prof = galsim.Sersic(n=4,half_light_radius=0.3,gsparams=gsparams)
        convolved_prof = galsim.Convolve(gal_prof,psf_prof,gsparams=gsparams)

        # Using no pixel method here since we plan to use a PSF profile
        # which already includes the pixel response
        convolved_prof.drawImage(convolved_image,method='no_pixel')

if __name__ == "__main__":
    runctx("main()",{},{"main":main},filename="convolve_time_test.prof")

The runtime of this is about 9.5ms per galaxy (according to the profiling data, looking just at the drawFFT method). The profiles and GSParams here result in using 256x256 arrays in k-space, which matches what Giuseppe uses, and his runtime is closer to 1ms per galaxy.

Unfortunately my profiler stops at the Python/C++ interface, so I can't tell what's taking up most of the time here. Comparing to Giuseppe's code, the extra array allocations and use of FFTW_ESTIMATE instead of FFTW_MEASURE were the biggest differences I noticed.

rmjarvis commented 7 years ago

This is why you should profile. The FFT is taking 0.9% of the time here. So fftw tweaks aren't going to help at all. Nor are the array copies significant.

The biggest thing is the k-space interpolation of the interpolated image. We use a Quintic interpolation in k-space for interpolated images, which Bernstein & Gruen, 2014 found was necessary to avoid significant errors in the shapes of the drawn images. (I'm guessing Giuseppe doesn't do that, so that's probably the biggest time difference between the two.) I don't think you probably want to mess with that if you're using this code for shear estimation.

However, we could speed things up a lot by only doing this step once for the PSF, rather than every time, since it's not changing between model draws. There's no easy way to do this in the current version of GalSim calls, but you could roll your own custom drawFFT function that would do a single setup step to make the initial k-space image, draw the PSF and pixel components onto it and save that. Then for each model, drawK() onto another image with the same size, bounds and multiply by the PSF/pixel image and finish the wrap/rfft calls. Since your galaxy is analytic, those steps should be extremely fast.

brgillis commented 7 years ago

Ahh, okay, that's useful. I was actually profiling the code, but I hadn't set it up to go beyond the Python layer. EDIT: Figured out how to do it with valgrind, and I see that now too.

In any case, yes, the interpolation only needs to be done once for each PSF profile, so that should indeed speed things up a lot for repeated runs like this. EDIT: I suspect this is a common use case, so if it is possible to find a way to do this, it will likely benefit more than just me.

rmjarvis commented 7 years ago

If you're on a Mac, Instruments isn't too bad at getting into the C++ layer. That's what I usually use.

rmjarvis commented 7 years ago

I refactored the drawFFT function a bit to make it easier to separate out the PSF and galaxy k images. Here is what your above script looks like with the PSF part pulled out of the loop. (This is on branch #873, since I was already doing other efficiency optimizations there.)

When I run it now, it takes about 2 ms per loop.

$ python time_conv.py 
         30423 function calls in 2.291 seconds

   Ordered by: internal time
   List reduced from 130 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1000    0.954    0.001    1.034    0.001 /Users/Mike/lib/python2.7/site-packages/galsim/gsobject.py:1496(drawFFT_finish)
     1001    0.940    0.001    0.953    0.001 /Users/Mike/lib/python2.7/site-packages/galsim/gsobject.py:1957(_drawKImage)
        1    0.174    0.174    0.174    0.174 /Users/Mike/lib/python2.7/site-packages/galsim/gsobject.py:282(maxK)
        1    0.084    0.084    2.291    2.291 time_conv.py:6(main)
     1002    0.065    0.000    0.065    0.000 {method 'reduce' of 'numpy.ufunc' objects}
     2001    0.023    0.000    0.023    0.000 /Users/Mike/lib/python2.7/site-packages/galsim/image.py:1095(_setZero)
     1001    0.012    0.000    0.015    0.000 /Users/Mike/lib/python2.7/site-packages/galsim/base.py:1025(__init__)
     1002    0.008    0.000    0.012    0.000 /Users/Mike/lib/python2.7/site-packages/galsim/transform.py:254(_Transform)
     2009    0.004    0.000    0.004    0.000 /Users/Mike/lib/python2.7/site-packages/galsim/image.py:461(bounds)
     1003    0.004    0.000    0.006    0.000 /Users/Mike/lib/python2.7/site-packages/galsim/bounds.py:194(_BoundsI)
     1000    0.003    0.000    0.006    0.000 /Users/Mike/lib/python2.7/site-packages/galsim/image.py:970(center)
     2012    0.003    0.000    0.005    0.000 /Users/Mike/lib/python2.7/site-packages/galsim/gsobject.py:209(__init__)
     2011    0.003    0.000    0.003    0.000 {built-in method __new__ of type object at 0x10b2e8788}
     4069    0.003    0.000    0.003    0.000 {isinstance}
     2020    0.002    0.000    0.002    0.000 {hasattr}
     1003    0.002    0.000    0.003    0.000 /Users/Mike/lib/python2.7/site-packages/galsim/image.py:473(scale)
     1001    0.002    0.000    0.067    0.000 {method 'sum' of 'numpy.ndarray' objects}
     1000    0.002    0.000    0.014    0.000 /Users/Mike/lib/python2.7/site-packages/galsim/gsobject.py:858(_shift)
        2    0.001    0.001    0.002    0.001 /Users/Mike/lib/python2.7/site-packages/galsim/interpolatedimage.py:256(__init__)
     1002    0.001    0.000    0.066    0.000 /Users/Mike/Library/Python/2.7/lib/python/site-packages/numpy/core/_methods.py:31(_sum)

About 1 second is drawing the k-image, which involves a lot of accessing the lookup table we use for the Sersic Hankel transform. The code for that is pretty generic and hasn't been super optimized, so there are probably some gains to be had there.

The other second is mostly the fft. I was wrong before about it only taking 0.9%. That was the time for the inverse_fft function that in turn calls fftw. (With the refactoring, it's up to 1.7%.) I didn't realize that my profiler isn't assigning the time in the fftw library to this function as it should. There are lots of orphan functions that fall outside the main code tree that seem to be from the fftw library. They probably add up to the majority of that 0.954 seconds cProfile marks as drawFFT_finish.

I tried switching to FFTW_MEASURE rather than FFTW_ESTIMATE, and it actually got slightly slower for 1000 iterations (about 4%). At 10,000 iterations is was slightly faster (about 1.5%). So maybe there's a slight gain to be had there for certain use cases, but probably not much.

If you'd like to try out further optimizations, please feel free. I think the tall pole right now is improving the calculation of the Sersic Hankel tranformation. (cf. #566) If we could figure out a nice fitting formula, that could probably be made a lot faster. Or short of that, maybe a custom lookup table that was specifically geared to this function might be able to be faster than what we have now.

brgillis commented 7 years ago

Thanks for looking into this - those changes should be quite helpful.

The FFTW_MEASURE change doesn't look worth it from your numbers, but that could easily be system dependent (if someone is using a system where the estimate is particularly bad for them, it could still be useful). To take advantage of it, it would probably require storing/loading wisdom across runs. I image an interface like:

galsim.use_fftw_wisdom('my_fftw_wisdom.dat') # warn but don't throw an exception if it doesn't exist
...work...
galsim.save_fftw_wisdom('my_fftw_wisdom.dat')

would work best, where FFTW_MEASURE is used if use_fftw_wisdom is called (even if the file doesn't exist), and FFTW_ESTIMATE otherwise. If this sounds good, I can look into adding it - I'm already doing something similar in a different project, so this shouldn't be too hard for me.

rmjarvis commented 7 years ago

That API sounds fine to me.

brgillis commented 7 years ago

Alright, I'll get to it once I have some time.

rmjarvis commented 5 years ago

I think the conclusion of this thread is that FFTW_MEASURE wasn't helpful, and I assume that in the past two years you have lost interest in adding it. Therefore, I'm closing this now, but please feel free to reopen if you feel inspired to work on it, or if you find a use case where FFTW_MEASURE makes a significant difference.

brgillis commented 5 years ago

Ah yeah, sorry for forgetting to update on this. It turned out we just couldn't make enough efficiency improvements to GalSim (compared to using a streamlined function for specific profiles) to make the switch to FFTW_MEASURE worth it for that use-case.

There are other parts of the pipeline that might be sped up a bit by this, but they're far from being the slowest parts, and so for now, work is much better spent on other parts of the code before it's worth making this optimization. So yeah, I'm fine leaving this closed for now, until/unless it becomes worthwhile to look at.

GalSim-developers / GalSim

Efficiency in convolution #875