Parallelizing pyramid modulation

rmjc commented 3 years ago

Hi all,

I'm using HCIPy to simulate an adaptive optics system for the TMT, and I'm finding that by far the longest step in my closed-loop simulation is the pyramid modulation (specifically the for loop in the forward function of the ModulatedPyramidWavefrontSensorOptics class). It should be possible to parallelize the operations being done in the for loop, since there's no reason why each modulation step needs to be computed sequentially rather than in parallel. Shortening this step would make a big difference in making HCIPy practical for high time resolution simulations!

I tried to implement a parallelized version of the for loop using the multiprocessing package, but I'm having trouble doing so because python can't pickle PyramidWavefrontSensorOptics (I also get the error that the 'TipTiltMirror' object has no attribute 'is_regular' but I don't know what that means). As a result, I'm finding that I need to re-create the PYWFS and TTM objects in each of the parallelized threads, which makes the speed improvement from parallelization much worse.

If you folks can think of a practical way to make these objects picklable or otherwise parallelize / speed up the modulation, I would be very eager to use those capabilities.

Cheers,

Becky

ehpor commented 3 years ago

HCIPy objects are meant to be pickleable, but we've had issues in the past where cases slipped under the radar / automated CI tests (see for example #92). I'll take a look at this optical element, and why it doesn't pickle correctly. FYI, is_regular is an attribute of a Grid, so if a TipTiltMirror gets asked for that, something weird is going on, either during the pickling or during construction.

In the meantime, you can try using multiprocess (https://pypi.org/project/multiprocess/) instead of multiprocessing? This uses dill instead of the standard Python pickle, which is able to pickle a lot more object types. In particular, lambda functions, which hcipy uses in abundance, can be pickled with dill but not by pickle. It should be just a simple import replacement, since the API is the same as the standard Python library.

Aside from this, I'm working on integrating GPU support into hcipy, which should be done over the next few months. This potentially provides speedups of over 10x over numpy / numexpr which hcipy uses now.

syhaffert commented 3 years ago

I am not sure if multi-processing will speed up the simulations. I have ran some tests by using multiprocess, but I saw that the single process approach was 3 to 4 times faster. In the figure below you can see the histograms

What I also see in my CPU usage statistics is that the SP approach utilizes my cpus 100%, while the multiprocessing approach fluctuates between 70% and 98%. My guess is that most of the computation is done in the propagations, which are either FFTs or matrix-vector multiplications. Both are already very optimized to do the calculations efficiently with many cores. There may be a gain with the MP approach if you go to a cluster with many CPU cores. I only have 20 cores in my desktop, so I don't know how it scales to larger systems.

ehpor commented 3 years ago

@syhaffert To confirm, multiprocess has no problem pickling these optical elements?

Also a general comment here. There usually is a minor gain in performance when going multiprocessing for parallelization, even in cases where a single process already uses the available CPU cores pretty effectively. Best performance then is often not reached for N processes for N cores, but rather with only 2-4 processes.

Furthermore, I presume you're including process startup overheads, which, given the 3-4x slower performance, are likely the vast majority of the measured time per frame. This also explains the lower CPU usage for multiprocess. To solve this, you can imagine keeping a process pool around, and reusing it for subsequent propagations, rather than starting a new one for each propagation.

syhaffert commented 3 years ago

Yes multiprocess had no problems at all with pickling the optical elements.

I probably can improve the performance by doing this in a smarter way. I will try it later. I thought it would be good to have the multi-process version implemented as the standard approach for the modulated PWFS. Maybe we should even make a dedicated implementation instead of relying on the un-modulated PWFS because we are doing a couple redundant propagations. I think we can increase the speed by at least a factor 2 by removing those.

ehpor commented 3 years ago

Good. Using multiprocess then solves the original problem in this issue. I'm gonna keep this open for a bit to discuss the speed improvements for PyramidWFS a little more.

@syhaffert Yes, moving the pyramid, rather than the PSF on top of a fixed Pyramid would be faster to simulate since you can reuse the same PSF for each pyramid offset. You'd have to worry about discretization errors for the moving Pyramid surface though. Also, this gives at most a factor of 2, asymptotically for a large number of modulation positions. Unless you're thinking of something different than this.

Way more influential on the performance is this: the internal propagations inside the PyWFS are using raw MFTs, but on very large grids. It's basically doing MFTs to calculate an FFT, so that's for sure a big waste in computation time, think an order of magnitude. Hacking that in just for the PyWFS is easy, but I'd prefer to make that more general, since this is something that has been bothering me for the last few years.

@rmjc As an intermediate quick fix that'll have big speed improvements with a minor change to the system you're simulating: you can add a spatial filter on the Pyramid by setting num_airy to the radius of the filter. By default, the whole field of view is taken into account, meaning num_pix / 2 in lam/D. Computation time scales with the square of num_airy, so this will have a big influence on performance. And you might be able to live with having a spatial filter of half the default value without much influence on the WFS image, giving you 4x the performance. Or even more, if you're willing to reduce it more than that.

syhaffert commented 3 years ago

The Fraunhofer propagator does make a rough estimate of what method is fastest, right? Or will it just select the MFT approach because I ask for a (potentially) non-integer q?

ehpor commented 3 years ago

It will always use an MFT, because you are supplying focal grids. There used to be a shortcut in the early days (before open source) where you supplied a (q, num_airy) tuple instead. In that case, the make_fourier_transform() function that FraunhoferPropagator uses, used to choose the best Fourier transform method. I removed that option as it caused some confusion from early users (including you, I seem to remember). So right now, it'll always put in actual grids, and will therefore never choose an FFT and always return an MFT.

The solution that I'm currently considering on is to have a grid history. The grid history records a list of operations that caused the grid to become what it is now. So an make_focal_grid() would return a grid that has a history of [scale, fft_grid] history. Then, the Fraunhofer propagator would scale that grid, to create a [scale, scale, fft_grid] grid (or better yet, the scale method on the grid would squish the two scale history items together to create a [scale, fft_grid] grid, or would even detect that the total scaling factor would be one, so remove the scale history item alltogether. The make_fourier_transform() function would then detect that your input is just an fft grid, so that it has the option to return an FFT to do the Fourier transform in case that's faster.

This feels too convoluted to be the right solution, but I haven't found a better one in the last two years, so I might have to go there, even though I don't like it.

syhaffert commented 3 years ago

Wouldn't a solution be to add a parameter to Fraunhofer propagator 'use_fft' ? Then if set to true you don't need to supply the focal grid, and it will choose the FFT based propagator.

ehpor commented 3 years ago

Wouldn't a solution be to add a parameter to Fraunhofer propagator 'use_fft' ? Then if set to true you don't need to supply the focal grid, and it will choose the FFT based propagator.

That would be very manual, which means that very few people are actually gonna benefit from it. I've pushed a PR for autodetecting FFTs, which seems to work okay. ~25-50% speed improvement for the modulated PyWFS, depending on number of pixels across the pupil.

Discard everything I said above about grid histories, btw. I didn't realize that make_focal_grid() no longer depends on make_fft_grid() since about two years ago, which makes the whole grid history idea bogus. I've found a better solution, which required only a few hours of tinkering to get working. Plus it made me discover the first bug in the Fourier transform subpackage since open-source release of hcipy. 😄

rmjc commented 2 years ago

@ehpor I wanted to check in on the status of the GPU support -- I would be interested in using this capability!

syhaffert commented 2 years ago

@syhaffert Yes, moving the pyramid, rather than the PSF on top of a fixed Pyramid would be faster to simulate since you can reuse the same PSF for each pyramid offset. You'd have to worry about discretization errors for the moving Pyramid surface though. Also, this gives at most a factor of 2, asymptotically for a large number of modulation positions. Unless you're thinking of something different than this.

I have implemented this solution and that seems to give a major speed boost. More than a factor 2! And it is also more stable in computation time. The variance/std decreased by a large amount. See the attached histogram where I am comparing the two different approaches.

Here is my implementation of the new forward method. I could use this new method to close the loop with a reconstruction matrix that was made with the old method. That probably means that there is not much difference between the two approaches. One thing that probably helps is my use of list comprehension to go through all the different points.

`def forward(self, wavefront): '''Propagates a wavefront through the modulated pyramid wavefront sensor.

    Parameters
    ----------
    wavefront : Wavefront
        The input wavefront that will propagate through the system.

    Returns
    -------
    wf_modulated : list
        A list of wavefronts for each modulation position.
    '''

    wf_modulated = []

    if self._new_method:
        #print("New method")

        wf_foc = self.pyramid_wavefront_sensor.spatial_filter.forward(self.pupil_to_focal.forward(wavefront))
        wf_modulated = [self.focal_to_pupil.forward( pyramid.forward(wf_foc) ) for pyramid in self._pyramid_apodizers]
    else:
        #print("Old method")

        for point in self.modulation_positions.points:
            self.tip_tilt_mirror.actuators = point
            modulated_wavefront = self.tip_tilt_mirror.forward(wavefront)

            wf_modulated.append(self.pyramid_wavefront_sensor.forward(modulated_wavefront))

    return wf_modulated`

rmjc commented 2 years ago

Thanks for sharing this implementation @syhaffert! Can you also share where you defined pyramid_apodizers?

ehpor / hcipy

Parallelizing pyramid modulation #112