Keypoint data structures are too slow.

Erotemic commented 5 years ago

I've been noticing massive slowdowns as I enable more augmenters in my training scripts. There is a lot of Python overhead incurred in the general architecture of the library.

Looking into it I found that keypoint augmentation is taking up most of the time. This is because each keypoint is augmented one at a time, even though all of the keypoints experience the same transformation. Also most every operation on keypoints seems to cause a deep copy, which incurs overhead of constructing new python objects. This may be fine for <10 keypoints, but it really starts to add up.

Demo of Speed Issue

Here is a MWE demonstrating that keypoint augmentation takes the majority of time in a CropAndPad augmenter.

import numpy as np
import imgaug.augmenters as iaa
import imgaug
import timerit

# Construct demo data
xydata = np.random.rand(1000, 2)
imdata = (np.random.rand(256, 256, 3) * 255).astype(np.uint8)
input_dims = imdata.shape[0:2]

# Build imgaug data structures
kps = [imgaug.Keypoint(x, y) for x, y in xydata]
kpoi = imgaug.KeypointsOnImage(kps, shape=tuple(input_dims))

# Build an augmentor
augmenter = iaa.CropAndPad(px=(0, 4))

ti = timerit.Timerit(100, bestof=10, verbose=2, unit='us')

for timer in ti.reset('time augment_image'):
    with timer:
        augmenter.augment_image(imdata)

for timer in ti.reset('time augment_keypoints'):
    with timer:
        augmenter.augment_keypoints(kpoi)

The output is:

Timed time augment_image for: 100 loops, best of 10
    time per loop: best=1367.766 µs, mean=1477.141 ± 54.5 µs

Timed time augment_keypoints for: 100 loops, best of 10
    time per loop: best=14879.146 µs, mean=15190.417 ± 263.6 µs

Augmenting the keypoints took 10x longer than augmenting an image!

Methods for Mitigation: numpy vectorization

This problem could be avoided if KeypointsOnImage had a better internal representation. Namely, instead of being a list of Keypoint object, it could simply store a single 2D numpy array of the keypoint locations. (Note this same optimization can be applied to bounding boxes).

Even with a conversion overhead I can demonstrate a 10x speedup by simply modifying the _crop_and_pad_kpsoi function. Instead of using internal KeypointsOnImage method I convert to a numpy array, perform all operations, and then convert back.

from imgaug.augmenters.size import _compute_shape_after_crop_and_pad
from imgaug.augmentables.utils import normalize_shape, project_coords
from imgaug.augmenters.size import _crop_and_pad_kpsoi

def _crop_and_pad_kpsoi_vectorized(kpoi, croppings_img, paddings_img, keep_size):
    crop_tr_y, _, _, crop_bl_x = croppings_img
    pad_tr_y, _, _, pad_bl_x = paddings_img
    x = pad_bl_x - crop_bl_x
    y = pad_tr_y - crop_tr_y

    xy = np.array([[kp.x, kp.y] for kp in kpoi.keypoints], dtype=np.float32)
    offset = np.array([[x, y]])

    xy_aug = xy + offset

    shifted_shape = _compute_shape_after_crop_and_pad(
            kpoi.shape, croppings_img, paddings_img)

    if keep_size:
        proj_shape = normalize_shape(kpoi.shape)
        xy_aug = project_coords(xy_aug, shifted_shape, proj_shape)
        shifted_shape = proj_shape

    kps = [imgaug.Keypoint(x, y) for x, y in xy_aug]
    shifted = imgaug.KeypointsOnImage(kps, shape=shifted_shape)
    return shifted

I time the old versus the new and assert they result in the same outputs.

ti = timerit.Timerit(10, bestof=10, verbose=2, unit='us')
paddings_img = (0, 3, 0, 2)
croppings_img = (0, 0, 0, 0)
keep_size = True

for timer in ti.reset('old _crop_and_pad_kpsoi2'):
    with timer:
        old = _crop_and_pad_kpsoi(kpoi, croppings_img, paddings_img, keep_size)

for timer in ti.reset('new _crop_and_pad_kpsoi_vectorized'):
    with timer:
        new = _crop_and_pad_kpsoi_vectorized(kpoi, croppings_img, paddings_img, keep_size)

assert old.shape == new.shape
for o, n in zip(old.keypoints, new.keypoints):
    assert np.isclose(o.x, n.x)
    assert np.isclose(o.y, n.y)

This gives us a 10x speedup!

Timed old _crop_and_pad_kpsoi2 for: 100 loops, best of 10
    time per loop: best=13239.054 µs, mean=16269.084 ± 1999.8 µs
Timed new _crop_and_pad_kpsoi_vectorized for: 100 loops, best of 10
    time per loop: best=2003.635 µs, mean=2061.045 ± 50.7 µs

You might note that this example was for 1000 keypoints, and you might think the vectorized solution would be slower for an item with only a few keypoints. It turns out this is untrue. For even 2 keypoints the vectorized solution is faster (44us vs 33us), and for a single keypoint the times are effectively the same.

Conversions are costly!

And note we can do MUCH better than the above implementation. If we take a look at the line profile results we see that the majority of the overhead is simply in doing the conversion from the List[Keypoint] backend to a numpy one:

Function: _crop_and_pad_kpsoi_vectorized at line 5

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     5                                           def _crop_and_pad_kpsoi_vectorized(kpoi, croppings_img, paddings_img, keep_size):
     6         1          3.0      3.0      0.1      crop_tr_y, _, _, crop_bl_x = croppings_img
     7         1          2.0      2.0      0.0      pad_tr_y, _, _, pad_bl_x = paddings_img
     8         1          1.0      1.0      0.0      x = pad_bl_x - crop_bl_x
     9         1          1.0      1.0      0.0      y = pad_tr_y - crop_tr_y
    10                                           
    11         1       1604.0   1604.0     30.6      xy = np.array([[kp.x, kp.y] for kp in kpoi.keypoints], dtype=np.float32)
    12         1         11.0     11.0      0.2      offset = np.array([[x, y]])
    13                                           
    14         1         73.0     73.0      1.4      xy_aug = xy + offset
    15                                           
    16         1          2.0      2.0      0.0      shifted_shape = _compute_shape_after_crop_and_pad(
    17         1         17.0     17.0      0.3              kpoi.shape, croppings_img, paddings_img)
    18                                           
    19         1          1.0      1.0      0.0      if keep_size:
    20         1          5.0      5.0      0.1          proj_shape = normalize_shape(kpoi.shape)
    21         1         90.0     90.0      1.7          xy_aug = project_coords(xy_aug, shifted_shape, proj_shape)
    22         1          1.0      1.0      0.0          shifted_shape = proj_shape
    23                                           
    24         1       3417.0   3417.0     65.2      kps = [imgaug.Keypoint(x, y) for x, y in xy_aug]
    25         1         10.0     10.0      0.2      shifted = imgaug.KeypointsOnImage(kps, shape=shifted_shape)
    26         1          1.0      1.0      0.0      return shifted

This means that if imgaug was able to improve its backends it wouldn't need to convert between formats as many times, so it would see massive speedups.

Proposal for Speedup: Vectorized objects with numpy backends

Ultimately, I think imgaug would benefit from moving to data structure primitives with vectorized backends (like numpy).

It may seem like this would come at the cost of code re-use, but that is not necessarily the case (although this will mean a codebase overhaul). Currently single-item objects like Keypoint/Box/etc are the data structure primitive, and XsOnImage reuses code by iterating over the single-item objects. However, if you were to use multi-item objects as primitives (using numpy backends of course, here is an example of how I implemented something similar with boxes) then you could implement the single-item objects as special cases of the multi-item objects.

That being said, that's probably too big of a change to ask for, however, I think a more realistic goal would be to just implement a Keypoints object that stores multiple keypoints as a numpy array, but exposes a __getitem__ to return a Keypoint object so KeypointsOnImage logic would work as-is. However, incremental improvements could be made to KeypointsOnImage such that when it sees that isinstance(self.keypoints, Keypoints) is True, it can use faster methods of the Keypoints object instead of looping over each individual item and incurring so much Python overhead.

For instance KeypointsOnImage.shift might look like this:

    def shift(self, x=0, y=0):
        if isinstance(self.keypoints, Keypoints): 
            keypoints = self.keypoints.shift(x=x, y=y)
        else:
            keypoints = [keypoint.shift(x=x, y=y) for keypoint in self.keypoints]
        return self.deepcopy(keypoints)

and Keypoints might have a shift method that looks like this:

    def shift(self, x=0, y=0): 
        offset = np.array([[x, y]])
        return self.deepcopy(self._data + offset)

Implementing it this way would allow for a gradual shift to the new vectorized style, provide immediate speedups

Erotemic commented 5 years ago

I actually was able to make a POC pretty quickly and I got a 77x speedup.

import numpy as np
import imgaug.augmenters as iaa
import imgaug
import timerit

# Construct demo data
xydata = np.random.rand(1000, 2)
imdata = (np.random.rand(256, 256, 3) * 255).astype(np.uint8)
input_dims = imdata.shape[0:2]

# Build imgaug data structures
kps = [imgaug.Keypoint(x, y) for x, y in xydata]
kpoi = imgaug.KeypointsOnImage(kps, shape=tuple(input_dims))

# Build an augmentor
augmenter = iaa.CropAndPad(px=(0, 4))

ti = timerit.Timerit(100, bestof=10, verbose=2, unit='us')

for timer in ti.reset('time augment_image'):
    with timer:
        augmenter.augment_image(imdata)

for timer in ti.reset('time augment_keypoints'):
    with timer:
        augmenter.augment_keypoints(kpoi)

kpoi.keypoints = imgaug.augmentables.Keypoints(kpoi.to_xy_array())
for timer in ti.reset('time augment newstyle keypoints'):
    with timer:
        augmenter.augment_keypoints(kpoi)

Results:

Timed time augment_image for: 100 loops, best of 10
    time per loop: best=908.287 µs, mean=916.182 ± 7.5 µs
Timed time augment_keypoints for: 100 loops, best of 10
    time per loop: best=14612.450 µs, mean=15076.556 ± 260.6 µs
Timed time augment newstyle keypoints for: 100 loops, best of 10
    time per loop: best=189.174 µs, mean=195.178 ± 3.2 µs

I will submit a PR with those modifications and then we can decide where to go from there.

aleju commented 5 years ago

I see the point of using vectorized implementations. So far I have evaded changing the codebase in that direction, as it is likely a lot of work and the keypoint augmentation felt fast enough to me. There is also the problem that users may want to associate information with each individual keypoint, such as the label ("left eye", "right hand", ...) or an identifier (e.g. a filename). That can become a significant headache to implement with an array-based approach if the number or order of keypoints changes during augmentation.

Regarding the approach that you proposed: As far as I can tell, it has the advantage of limiting the changes to the codebase, while still gaining the speed increases. It has the disadvantages of inducing a lot of isinstance statements and adding a class that "masquerades" as another class, which increases the code complexity and will potentially make it harder to reason about what the code does and where errors are coming from. In its current form it also has the disadvantage that the user explicitly has to create Keypoints instances to get the speed advantage.

An alternative (but fairly similar) way forward would be to introduce a parallel class to KeypointsOnImage, e.g. KeypointsArrayOnImage (which could initially be marked as private until it is well established). That class would then contain .keypoints as an array -- and only an array. It reimplements all methods of KeypointsOnImage, similar to your array-based implementations. Then, upon calling augment_keypoints(), each KeypointsOnImage instance is changed to an KeypointsArrayOnImage instance (unless it already is one) and the augmentation proceeds with the array-based class. Each augmenter would then expect that class and not KeypointsOnImage. After the augmentation is finished, the KeypointsArrayOnImage instances are transformed back to KeypointsOnImage instances. This has the advantage of having fairly little impact on the codebase, being easy to reason about and giving the speed boost to all users. It also evades potentially having to duplicate tests for augmenters -- they would always just test array-based approaches. For the future it would give a path forward towards promoting KeypointsArrayOnImage to the standard class for keypoints. The approach has the disadvantage of still requiring the transformation from KeypointsOnImage to KeypointsArrayOnImage, which isn't very fast. This doesn't necessarily have to be a significant downside. The cost of the operation would be amortized to a degree when using multiple augmenters per batch, as the transformation only has to be done once (though that is only relevant if there is more than one augmenter in the pipeline that affects spatial pixel locations, e.g. crop + horizontal flips + affine). If a users calls augment() with keypoints given as a numpy array, the whole transformation could also be skipped. It also still has the downside of making it hard to deal with changes to the order of keypoints or dealing with situations where keypoints are added or deleted. Though of the class also contained a list of keypoint indices, that might already be enough to deal with the issue. (E.g. reverse the indices list of the order of the keypoints is reversed.)

I also wonder if there is a way to make the transformation from Keypoint objects to a numpy array faster. If instead of using xy = np.array([[kp.x, kp.y] for kp in kpoi.keypoints], dtype=np.float32) one would first create an uninitialized array via np.empty((N,2), dtype=np.float32) and then write the xy-coordinates into that array via a nested cython loop, maybe that would end up being faster.

harpone commented 4 years ago

Just a FYI I'm experiencing similar slowdowns with transforming keypoints taking about 5x longer than heavy augmentations for images...

I was a bit horrified to find out that the keypoint augmentations are done inside a list comp for each point :D this is definitely a big blocker for me.

Thanks @Erotemic for the vectorized implementation! I'll give that a try.

IMO all operations should be vectorized by default and having extra info per keypoint should be the exception. Then again, using numpy string arrays should be pretty fast too (e.g. casting to np.array_ which gives a square byte array and is therefore very fast and doesn't use much RAM).

Other than this issue, I really like imgaug. Thanks @aleju for a great library!

Schobs commented 2 years ago

@harpone Did you ever solve this? Did you manage to speed up keypoint augmentations, or find another library? Albumentations is ok, but they don't offer most augmentations for keypoints e.g. elastic.

harpone commented 2 years ago

using albumentations nowadays most of the time, and haven't had much need for keypoint augs since.

Schobs commented 2 years ago

But albumentations doesn’t support elastic deformations for key points. Unless I fork/branch and add that feature myself, I think I’m stuck with imagaug.

On Fri, 25 Mar 2022 at 07:53, Heikki Arponen @.***> wrote:

using albumentations nowadays most of the time, and haven't had much need for keypoint augs since.

— Reply to this email directly, view it on GitHub https://github.com/aleju/imgaug/issues/426#issuecomment-1078750173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFYSMI7CBC4PJUHJ5AAXMB3VBVWJHANCNFSM4IXEXBYA . You are receiving this because you commented.Message ID: @.***>

aleju / imgaug