Possible speed ups for aiapy.calibrate.register() #63

Closed nabobalis closed 1 year ago

In GitLab by @cbard on Sep 1, 2020, 07:57

tl;dr Easiest acceleration (3-4x vs. orig) is to wait for contains_full_disk improvement in upstream sunpy. Recommended to use OpenCV affine transform (8-9x acceleration vs. orig). Recommended to test registering raw data and then creating Sunpy Map (instead of currently creating map and then registering map data).

As part of the Helio Hackweek 2020 AIApy project, Raphael Attie (@WaaallEEE) and I looked at aiapy.calibrate.register to see how it could be sped up. We wrote some basic replacements for register and benchmarked them. To summarize:

cfd_register: Jack Ireland changed sunpy.map.maputils.contains_full_disk to test only the edge pixels and not every pixel. This is an ongoing pull request in sunpy, but the original version made the overall register 3-4x faster. (This technically has nothing to do with the actual affine transform).
cupy_register: Drop in a one-to-one CuPy replacement for scipy.ndimage.affine_transform (CAVEAT: only supports linear interpolation at the moment). This is about 6-9x faster than register.
cv_register: Implement R. Attie's version of register using OpenCV (cv2) python library and merge it with the current sunpy meta handling. This is about 6-9x faster than register.

So, it looks like upgrading to the openCV affine transform is the ideal path forward. This may change in a year or two, as further improvements to CuPy (like adding cubic interpolation to affine_transform) are made.

Raphael also broke down register into its individual components for timing, namely the metadata overhead versus the actual affine transformation. His conclusion:

using SunPy maps to update the metadata adds 2000% bottleneck with respect to the fastest AIA data prepping, which is only an interpolation using an Affine Transform. While these accelerated interpolations offer up to ~10x speedup (OpenCV) with respect to the original method, it would be premature optimization to accelerate the Affine Transform & interpolation method any further as long as the Sunpy overhead (~2 seconds) is not dealt with.

register_timing_table (table courtesy of R. Attie)

Ultimately, we conclude that it may make more sense to do a data prep pipeline without the Sunpy maps; i.e. loading the data, preparing it, and then creating the finished Map. The added abstraction prevents vectorization of the data prep which can be leveraged for acceleration. This data load could be done by using astropy.io.fits, which is already included in the sunpy dependencies.

In GitLab by @markcheung on Sep 1, 2020, 11:04

Thanks @cbard, Jack and Raphael. This is really great.

seems like a no-brainer and I hope it gets merged to sunpy soon.
A fair comparison would be linear interpolation on the CPU. Is there a comparable speedup?
AFAIK, OpenCV is a well-maintained package with lots of support and developers. If Sunpy devs don't mind depending on a large external package, we wouldn't mind using it for aia_register (contingent on verification of results).

In GitLab by @cbard on Sep 1, 2020, 12:53

A quick comparison on my laptop using %timeit gives:

type	time
original 3rd/1st	12/11 s
new disk check (3rd/1st)	2.8/2.03s
openCV (3rd/1st)	1.53/1.53s

The first two rows are using the skimage.warp, not the scipy.ndimage.affine_transform (which adds another second to runtime).

Raphael, on his desktop, did compare 1st vs. 3rd order interpolations with openCV. He got similar results for both: 625-650 ms (compare with 5 seconds for the original 3rd order register).

So, yes, I think it is a comparable speedup. OpenCV does look like the fastest implementation of affine_transform (in any order), but getting the full benefits will require changing how register deals with map metadata and uses the native sunpy functions.

In GitLab by @wtbarnes on Sep 3, 2020, 08:00

I think it would be interesting to first see whether the OpenCV affine transform can be used as a "backend" (in the same way we use scikit-image or scipy) for the affine transform in sunpy. I would really not like this package to be in the business of defining its own image manipulation routines unless the problem is very specific to AIA images (which an affine warping is not).

The way I see it, this package should view Map as a fundamental data type. Functions should ingest Maps and return Maps and only operate on images outside of the Map container unless absolutely necessary. If we are spending a lot of time doing some generic (not specific to AIA) operation on the bare array, I think that in and of itself is a good indication that that functionality should be offloaded to a lower level package.

I say all of this not to discourage efforts to optimize this process, but just to make sure what we're doing is in the scope of this package.

In GitLab by @markcheung on Sep 3, 2020, 08:48

I parse Will’s comments as encouragement for the Sunpy team to optimize some routines commonly used in workflows. I guess what Jack did is a start.

I have used astropy.io.fits to load AIA images too. It probably is worth being an example in the aiapy gallery. Now, if registering can be 5 or 10 times faster without Sunpy Map, we should offer it as an alternate workflow.

I think two principles we can keep in mind if we were to offer an alternate workflow is:

make sure WCS keywords at level 1.5 remain the same,
Abs diff (pixel-wise) < some threshold.

In GitLab by @cbard on Sep 3, 2020, 10:01

I agree with what you both have said. Some points of clarification, and details of specifics:

I'm not sure if we've shown that register is 5-10x faster without sunpy.Map since all of the updated register functions technically use sunpy.Map. Rather, we've shown that register can be 5-10x faster if 1) we streamline the overhead in register and sunpy.map.rotate, and 2) we add the OpenCV version of warpAffine upstream in sunpy.map.rotate.

It just turns out that, since the CuPy and OpenCV transforms promise execution times of ~100-200 ms (versus 1-3 seconds for skimage/scipy.ndimage), the bottleneck would no longer be the affine transform, but rather the overhead prior to calling it. From Raphael's timing breakdown, the overall overhead (before and during register) includes a few hundred ms for loading through sunpy.Map, ~2 seconds for update_pointing, and 100 ms for using a temporary map to do the pixel coordinate calculations within map.rotate. I think any source of possible improvement would start with looking at these functions and either accelerating them or keeping only what's necessary for AIA_prep.

In GitLab by @Cadair on Sep 9, 2020, 11:55

@wtbarnes brought this issue to my attention today, and I think that while I fully expect there to be thing in SunPy which can be made faster I want to raise some points about the methodology needed when benchmarking code and also the most constructive ways of moving forward.

First however, a generic point. SunPy is very open to people coming to us and saying "this is slow, can it be made faster". We are (as always) even more open to people coming to us with a code contribution saying "here I made this thing faster"!

So I think to ascertain what needs to be done a methodical bit of benchmarking needs to be done. It's very important to compare like for like. I am also coming at this strictly from a "how can sunpy be made faster" perspective rather than a "can we make a thing in python that does register faster than aiapy".

So the first question to ask is why is it slow.

I am going to start by benchmarking the following script, and then modify it a little as we need to:

import sunpy.data.sample
import sunpy.map

from aiapy.calibrate import register

aia_map = sunpy.map.Map(sunpy.data.sample.AIA_171_ROLL_IMAGE)

register(aia_map)

running this using the wonderful pyinstrument package gives the following output, which I will hide as it's very long and then come back to:

``` 7.275 bench_aia_simple.py:1 ├─ 5.851 register aiapy/calibrate/prep.py:14 │ ├─ 3.731 contains_full_disk sunpy/map/maputils.py:127 │ │ ├─ 3.441 is_all_off_disk sunpy/map/maputils.py:202 │ │ │ ├─ 2.481 all_coordinates_from_map sunpy/map/maputils.py:38 │ │ │ │ └─ 2.416 wrapper astropy/units/decorators.py:178 │ │ │ │ └─ 2.416 pixel_to_world sunpy/map/mapbase.py:1074 │ │ │ │ ├─ 1.959 wcs_pix2world astropy/wcs/wcs.py:1415 │ │ │ │ │ └─ 1.959 _array_converter astropy/wcs/wcs.py:1269 │ │ │ │ │ └─ 1.959 _return_list_of_arrays astropy/wcs/wcs.py:1275 │ │ │ │ │ └─ 1.903 astropy/wcs/wcs.py:1419 │ │ │ │ └─ 0.356 __init__ astropy/coordinates/sky_coordinate.py:267 │ │ │ │ ├─ 0.178 _parse_coordinate_data astropy/coordinates/sky_coordinate_parsers.py:216 │ │ │ │ │ └─ 0.144 __new__ astropy/coordinates/angles.py:612 │ │ │ │ │ └─ 0.125 wrap_angle astropy/coordinates/angles.py:656 │ │ │ │ │ └─ 0.125 _wrap_internal astropy/coordinates/angles.py:630 │ │ │ │ │ └─ 0.124 [self] │ │ │ │ └─ 0.177 __init__ sunpy/coordinates/frames.py:118 │ │ │ │ └─ 0.176 __init__ astropy/coordinates/baseframe.py:415 │ │ │ │ └─ 0.175 __init__ astropy/coordinates/representation.py:1542 │ │ │ │ └─ 0.175 __init__ astropy/coordinates/representation.py:681 │ │ │ │ └─ 0.175 __init__ astropy/coordinates/representation.py:212 │ │ │ │ └─ 0.139 astropy/coordinates/representation.py:260 │ │ │ │ └─ 0.124 __new__ astropy/coordinates/angles.py:612 │ │ │ │ └─ 0.124 wrap_angle astropy/coordinates/angles.py:656 │ │ │ │ └─ 0.122 _wrap_internal astropy/coordinates/angles.py:630 │ │ │ └─ 0.956 wrapper astropy/units/decorators.py:178 │ │ │ └─ 0.956 coordinate_is_on_solar_disk sunpy/map/maputils.py:173 │ │ │ └─ 0.853 __getattr__ astropy/coordinates/sky_coordinate.py:758 │ │ │ └─ 0.853 __getattr__ astropy/coordinates/baseframe.py:1620 │ │ │ └─ 0.853 represent_as sunpy/coordinates/frames.py:133 │ │ │ ├─ 0.532 wrap_angle astropy/coordinates/angles.py:656 │ │ │ │ └─ 0.529 _wrap_internal astropy/coordinates/angles.py:630 │ │ │ │ └─ 0.514 [self] │ │ │ └─ 0.321 represent_as astropy/coordinates/baseframe.py:995 │ │ │ ├─ 0.141 __init__ astropy/coordinates/representation.py:1826 │ │ │ │ └─ 0.141 __init__ astropy/coordinates/representation.py:681 │ │ │ │ └─ 0.141 __init__ astropy/coordinates/representation.py:212 │ │ │ │ └─ 0.140 astropy/coordinates/representation.py:260 │ │ │ │ └─ 0.125 __new__ astropy/coordinates/angles.py:612 │ │ │ │ └─ 0.125 wrap_angle astropy/coordinates/angles.py:656 │ │ │ │ └─ 0.123 _wrap_internal astropy/coordinates/angles.py:630 │ │ │ └─ 0.141 represent_as astropy/coordinates/representation.py:1603 │ │ │ └─ 0.141 __init__ astropy/coordinates/representation.py:1826 │ │ │ └─ 0.141 __init__ astropy/coordinates/representation.py:681 │ │ │ └─ 0.141 __init__ astropy/coordinates/representation.py:212 │ │ │ └─ 0.140 astropy/coordinates/representation.py:260 │ │ │ └─ 0.124 __new__ astropy/coordinates/angles.py:612 │ │ │ └─ 0.123 wrap_angle astropy/coordinates/angles.py:656 │ │ │ └─ 0.122 _wrap_internal astropy/coordinates/angles.py:630 │ │ └─ 0.082 sunpy/map/maputils.py:160 │ │ └─ 0.081 __getitem__ astropy/units/quantity.py:1037 │ ├─ 1.981 wrapper astropy/units/decorators.py:178 │ │ └─ 1.981 rotate sunpy/map/mapbase.py:1212 │ │ └─ 1.810 affine_transform sunpy/image/transform.py:14 │ │ ├─ 1.543 warp skimage/transform/_warps.py:666 │ │ │ [9 frames hidden] skimage, numpy │ │ ├─ 0.111 _showwarnmsg warnings.py:96 │ │ │ └─ 0.111 _showwarning astropy/logger.py:170 │ │ │ └─ 0.107 warning logging/__init__.py:1436 │ │ │ [5 frames hidden] logging │ │ │ 0.107 handle logging/__init__.py:937 │ │ │ └─ 0.107 emit astropy/logger.py:528 │ │ │ └─ 0.106 color_print astropy/utils/console.py:325 │ │ │ └─ 0.106 _get_stdout astropy/utils/console.py:103 │ │ │ └─ 0.106 __get__ astropy/utils/decorators.py:669 │ │ │ └─ 0.106 OutStream astropy/utils/console.py:52 │ │ │ └─ 0.089 __get__ astropy/utils/decorators.py:669 │ │ │ └─ 0.089 get_ipython astropy/utils/console.py:44 │ │ │ └─ 0.089 IPython/__init__.py:2 │ │ │ [327 frames hidden] IPython, prompt_toolkit, collections,... │ │ └─ 0.088 skimage/transform/__init__.py:1 │ │ [143 frames hidden] skimage, scipy, inspect, numpy, colle... │ └─ 0.121 inner_f sunpy/util/decorators.py:307 │ └─ 0.121 wrapper astropy/units/decorators.py:178 │ └─ 0.121 submap sunpy/map/mapbase.py:1397 ├─ 0.484 sunpy/data/__init__.py:1 │ ├─ 0.200 __init__ sunpy/data/data_manager/cache.py:35 │ │ └─ 0.200 __new__ astropy/time/core.py:382 │ │ └─ 0.199 update_leap_seconds astropy/time/core.py:2569 │ │ └─ 0.195 astropy/utils/iers/__init__.py:2 │ │ └─ 0.195 astropy/utils/iers/iers.py:2 │ │ └─ 0.194 astropy/table/__init__.py:3 │ │ └─ 0.109 astropy/io/fits/__init__.py:3 │ │ └─ 0.098 astropy/io/fits/convenience.py:3 │ │ └─ 0.098 astropy/io/fits/diff.py:2 │ │ └─ 0.097 astropy/io/fits/hdu/__init__.py:3 │ │ └─ 0.095 astropy/io/fits/hdu/compressed.py:3 │ │ └─ 0.094 astropy/io/fits/hdu/image.py:3 │ └─ 0.156 sunpy/data/data_manager/__init__.py:1 │ └─ 0.154 sunpy/data/data_manager/cache.py:1 │ └─ 0.150 sunpy/time/__init__.py:1 │ └─ 0.150 sunpy/time/time.py:1 │ └─ 0.150 pandas/__init__.py:3 │ [473 frames hidden] pandas, matplotlib, inspect, re, sre_... ├─ 0.360 __call__ sunpy/map/map_factory.py:297 │ └─ 0.360 _parse_args sunpy/map/map_factory.py:188 │ └─ 0.359 wrapper sunpy/util/functools.py:17 │ └─ 0.359 _parse_path sunpy/map/map_factory.py:279 │ └─ 0.359 _read_file sunpy/map/map_factory.py:151 │ └─ 0.359 read_file sunpy/io/file_tools.py:55 │ └─ 0.359 read sunpy/io/fits.py:43 │ ├─ 0.178 __get__ astropy/utils/decorators.py:752 │ │ └─ 0.178 data astropy/io/fits/hdu/image.py:219 │ │ └─ 0.178 _get_scaled_image_data astropy/io/fits/hdu/image.py:767 │ │ └─ 0.178 _get_raw_data astropy/io/fits/hdu/base.py:507 │ │ └─ 0.178 readarray astropy/io/fits/file.py:246 │ │ └─ 0.177 _array_from_file astropy/io/fits/util.py:563 │ │ └─ 0.171 read gzip.py:287 │ │ [9 frames hidden] gzip, _compression │ └─ 0.173 fitsopen astropy/io/fits/hdu/hdulist.py:35 │ └─ 0.173 fromfile astropy/io/fits/hdu/hdulist.py:389 │ └─ 0.173 _readfrom astropy/io/fits/hdu/hdulist.py:1040 │ └─ 0.172 _read_next_hdu astropy/io/fits/hdu/hdulist.py:1116 │ └─ 0.171 readfrom astropy/io/fits/hdu/base.py:296 │ └─ 0.171 seek astropy/io/fits/file.py:374 │ └─ 0.171 seek gzip.py:368 │ [11 frames hidden] gzip, _compression ├─ 0.320 sunpy/map/__init__.py:1 │ ├─ 0.236 sunpy/map/mapbase.py:1 │ └─ 0.082 sunpy/map/sources/__init__.py:1 │ └─ 0.081 sunpy/map/map_factory.py:1 │ └─ 0.080 sunpy/map/mapsequence.py:1 │ └─ 0.076 matplotlib/animation.py:19 │ [21 frames hidden] matplotlib, subprocess ├─ 0.134 aiapy/calibrate/__init__.py:1 │ [2 frames hidden] aiapy │ 0.134 aiapy/calibrate/meta.py:1 │ └─ 0.134 sunpy/net/__init__.py:3 └─ 0.126 sunpy/__init__.py:1 └─ 0.119 sunpy/tests/runner.py:1 └─ 0.119 astropy/__init__.py:2 └─ 0.096 _check_requirement astropy/__init__.py:77 └─ 0.094 numpy/__init__.py:1 [171 frames hidden] numpy, pickle, struct, re, pathlib, u... ```

The important bits here are:

The whole script took 7s

7.275 <module>  ../../bench_aia_simple.py:1

about 1s of this is imports, which while a valid target for optimisation I am going to ignore. Next up is actually loading the map:

├─ 0.360 __call__  sunpy/map/map_factory.py:297
│  └─ 0.360 _parse_args  sunpy/map/map_factory.py:188
│     └─ 0.359 wrapper  sunpy/util/functools.py:17
│        └─ 0.359 _parse_path  sunpy/map/map_factory.py:279
│           └─ 0.359 _read_file  sunpy/map/map_factory.py:151
│              └─ 0.359 read_file  sunpy/io/file_tools.py:55
│                 └─ 0.359 read  sunpy/io/fits.py:43
│                    ├─ 0.173 fitsopen  astropy/io/fits/hdu/hdulist.py:35
│                    │  └─ 0.173 fromfile  astropy/io/fits/hdu/hdulist.py:389
│                    │     └─ 0.173 _readfrom  astropy/io/fits/hdu/hdulist.py:1040
│                    │        └─ 0.172 _read_next_hdu  astropy/io/fits/hdu/hdulist.py:1116
│                    │           └─ 0.171 readfrom  astropy/io/fits/hdu/base.py:296
│                    │              └─ 0.171 seek  astropy/io/fits/file.py:374
│                    │                 └─ 0.171 seek  gzip.py:368
│                    │                       [83 frames hidden]  gzip, _compression
│                    └─ 0.178 __get__  astropy/utils/decorators.py:752
│                       └─ 0.178 data  astropy/io/fits/hdu/image.py:219
│                          └─ 0.178 _get_scaled_image_data  astropy/io/fits/hdu/image.py:767
│                             └─ 0.178 _get_raw_data  astropy/io/fits/hdu/base.py:507
│                                └─ 0.178 readarray  astropy/io/fits/file.py:246
│                                   └─ 0.177 _array_from_file  astropy/io/fits/util.py:563
│                                      └─ 0.171 read  gzip.py:287
│                                            [137 frames hidden]  gzip, _compression

Almost all of this (other than 0.001s) is loading the FITS file into memory, and implemented in astropy, so nothing for sunpy to worry about here.

Next is register, which is where things get interesting.

The total time is:

└─ 5.851 register  aiapy/calibrate/prep.py:14

which as pointed out is 63% the submap detection in register, which Jack has proposed an optimsation for already:

   ├─ 3.731 contains_full_disk  sunpy/map/maputils.py:127

Almost all the rest is in the call to GenericMap.rotate

   ├─ 1.981 wrapper  astropy/units/decorators.py:178
   │  └─ 1.981 rotate  sunpy/map/mapbase.py:1212

with 0.12 seconds being in the submap operation at the end of register:

   └─ 0.121 inner_f  sunpy/util/decorators.py:307
      └─ 0.121 wrapper  astropy/units/decorators.py:178
         └─ 0.121 submap  sunpy/map/mapbase.py:1397

So, having tackled the obvious low-hanging fruit of the contains_full_disk function we can see the next thing is the GenericMap.rotate method that we should look at.

To do this properly I am going to change our script to be this:

import sunpy.data.sample
import sunpy.map

from pyinstrument import Profiler

aia_map = sunpy.map.Map(sunpy.data.sample.AIA_171_ROLL_IMAGE)

profiler = Profiler()
profiler.start()

aia_map.rotate(order=3, recenter=True)

profiler.stop()
print(profiler.output_text(unicode=True, color=True))

which hopefully is a good approximation of the rotate call in register, at least good enough for benchmarking. Running this script gives:

1.951 <module>  bench_aia.py:1
└─ 1.936 wrapper  astropy/units/decorators.py:178
   └─ 1.936 rotate  sunpy/map/mapbase.py:1212
      ├─ 1.767 affine_transform  sunpy/image/transform.py:15
      │  ├─ 1.464 warp  skimage/transform/_warps.py:666
      │  │     [15 frames hidden]  skimage, numpy, <__array_function__ i...
      │  │        1.335 [self]  
      │  ├─ 0.126 <module>  skimage/transform/__init__.py:1
      │  │     [126 frames hidden]  skimage, scipy, inspect, numpy, enum,...
      │  ├─ 0.106 _showwarnmsg  warnings.py:96
      │  │  └─ 0.106 _showwarning  astropy/logger.py:170
      │  │     └─ 0.103 warning  logging/__init__.py:1436
      │  │           [frames hidden]  logging
      │  └─ 0.066 [self]  
      ├─ 0.059 reference_coordinate  sunpy/map/mapbase.py:884
      │  └─ 0.056 coordinate_frame  sunpy/map/mapbase.py:480
      │     └─ 0.049 wcs  sunpy/map/mapbase.py:419
      ├─ 0.054 world_to_pixel  sunpy/map/mapbase.py:1039
      │  ├─ 0.025 coordinate_frame  sunpy/map/mapbase.py:480
      │  │  └─ 0.022 wcs  sunpy/map/mapbase.py:419
      │  └─ 0.023 wcs  sunpy/map/mapbase.py:419
      └─ 0.052 pad  <__array_function__ internals>:2
            [6 frames hidden]  <__array_function__ internals>, numpy

The actual numerical chunk of the work here is completing in 1.661 seconds, out of the total 1.936 seconds. This tells me that this method has relatively little overhead in the actual sunpy code and that mostly it's the C layer code in scikit-image which is taking the compute time.

At this point the next step is to swap this out with another implementation of an affine transform (opencv for example) and see how they compare (both in terms of performance, but also accuracy and consistency with the existing implementation). I am not about to do that (my dinner just arrived as I type this) but I can tell you where to start to do it in an informative way. To compare how much faster opencv is you need to fork sunpy and modify the affine_transform wrapper to call opencv's warpAffine method rather than scikit-image's affine_transform. If you can construct a call which is equivalent, and the you repeat my benchmark you will get a directly comparible result.

A final note on implementation. I think that if the opencv or the cupy implementations are sufficently fast while maintaining the same behaviour, then we can look at adding a way to use them though Map.rotate. I think that the best way to do this would probably be to allow people to pass a function through which get's used to do the affine transform inside sunpy's sunpy.image.transform.affine_transform. The hardest part about this would be strictly describing and specifying the behaviour of that function so that you are likely to do it right and get comparible answers.

p.s. sorry for the essay :grinning:

In GitLab by @cbard on Sep 9, 2020, 13:00

@Cadair - Looks like you're getting similar timings as Raphael got for the original register routine, so that's good to see. I agree that an unified benchmarking approach other than %timeit is necessary, so thanks for pointing me to pyinstrument! I'll take a crack at adding the openCV warpAffine to sunpy.map.rotate and see what falls out.

In GitLab by @cbard on Sep 16, 2020, 14:37

@wtbarnes @Cadair Finally figured out the nuances of openCV.warpAffine vs. skimage.warp vs. scipy.ndimage.transform. I've added my changes to a local fork and run some benchmarking tests.

Basically, openCV.warpAffine is 3x faster than skimage.warp on my laptop.

So, I think it's worth trying to add this somehow to the upstream sunpy repo. However, I'm not sure which of a hard-coded addition of openCV (which I have done) or a more general "pass function to do affine_transform" (which Stuart has suggested) is more appropriate.

In GitLab by @wtbarnes on Sep 16, 2020, 15:56

Thanks @cbard! Would you mind creating an issue on the main sunpy repo regarding this?

I think that having a more general way to pass a function to perform the affine transform would probably be preferable as it is also more future proof.

In GitLab by @wtbarnes on Sep 28, 2020, 10:17

Given that this discussion has now moved entirely to this sunpy issue, I'm going to close this. Once those performance updates are included in sunpy, we can revisit this.

LM-SAL / aiapy

Possible speed ups for aiapy.calibrate.register() #63