Open Erotemic opened 5 years ago
I actually was able to make a POC pretty quickly and I got a 77x speedup.
import numpy as np
import imgaug.augmenters as iaa
import imgaug
import timerit
# Construct demo data
xydata = np.random.rand(1000, 2)
imdata = (np.random.rand(256, 256, 3) * 255).astype(np.uint8)
input_dims = imdata.shape[0:2]
# Build imgaug data structures
kps = [imgaug.Keypoint(x, y) for x, y in xydata]
kpoi = imgaug.KeypointsOnImage(kps, shape=tuple(input_dims))
# Build an augmentor
augmenter = iaa.CropAndPad(px=(0, 4))
ti = timerit.Timerit(100, bestof=10, verbose=2, unit='us')
for timer in ti.reset('time augment_image'):
with timer:
augmenter.augment_image(imdata)
for timer in ti.reset('time augment_keypoints'):
with timer:
augmenter.augment_keypoints(kpoi)
kpoi.keypoints = imgaug.augmentables.Keypoints(kpoi.to_xy_array())
for timer in ti.reset('time augment newstyle keypoints'):
with timer:
augmenter.augment_keypoints(kpoi)
Results:
Timed time augment_image for: 100 loops, best of 10
time per loop: best=908.287 µs, mean=916.182 ± 7.5 µs
Timed time augment_keypoints for: 100 loops, best of 10
time per loop: best=14612.450 µs, mean=15076.556 ± 260.6 µs
Timed time augment newstyle keypoints for: 100 loops, best of 10
time per loop: best=189.174 µs, mean=195.178 ± 3.2 µs
I will submit a PR with those modifications and then we can decide where to go from there.
I see the point of using vectorized implementations. So far I have evaded changing the codebase in that direction, as it is likely a lot of work and the keypoint augmentation felt fast enough to me. There is also the problem that users may want to associate information with each individual keypoint, such as the label ("left eye", "right hand", ...) or an identifier (e.g. a filename). That can become a significant headache to implement with an array-based approach if the number or order of keypoints changes during augmentation.
Regarding the approach that you proposed: As far as I can tell, it has the advantage of limiting the changes to the codebase, while still gaining the speed increases. It has the disadvantages of inducing a lot of isinstance
statements and adding a class that "masquerades" as another class, which increases the code complexity and will potentially make it harder to reason about what the code does and where errors are coming from. In its current form it also has the disadvantage that the user explicitly has to create Keypoints
instances to get the speed advantage.
An alternative (but fairly similar) way forward would be to introduce a parallel class to KeypointsOnImage
, e.g. KeypointsArrayOnImage
(which could initially be marked as private until it is well established). That class would then contain .keypoints
as an array -- and only an array. It reimplements all methods of KeypointsOnImage
, similar to your array-based implementations. Then, upon calling augment_keypoints()
, each KeypointsOnImage
instance is changed to an KeypointsArrayOnImage
instance (unless it already is one) and the augmentation proceeds with the array-based class. Each augmenter would then expect that class and not KeypointsOnImage
. After the augmentation is finished, the KeypointsArrayOnImage
instances are transformed back to KeypointsOnImage
instances. This has the advantage of having fairly little impact on the codebase, being easy to reason about and giving the speed boost to all users. It also evades potentially having to duplicate tests for augmenters -- they would always just test array-based approaches. For the future it would give a path forward towards promoting KeypointsArrayOnImage
to the standard class for keypoints.
The approach has the disadvantage of still requiring the transformation from KeypointsOnImage
to KeypointsArrayOnImage
, which isn't very fast. This doesn't necessarily have to be a significant downside. The cost of the operation would be amortized to a degree when using multiple augmenters per batch, as the transformation only has to be done once (though that is only relevant if there is more than one augmenter in the pipeline that affects spatial pixel locations, e.g. crop + horizontal flips + affine). If a users calls augment()
with keypoints given as a numpy array, the whole transformation could also be skipped.
It also still has the downside of making it hard to deal with changes to the order of keypoints or dealing with situations where keypoints are added or deleted. Though of the class also contained a list of keypoint indices, that might already be enough to deal with the issue. (E.g. reverse the indices list of the order of the keypoints is reversed.)
I also wonder if there is a way to make the transformation from Keypoint
objects to a numpy array faster. If instead of using xy = np.array([[kp.x, kp.y] for kp in kpoi.keypoints], dtype=np.float32)
one would first create an uninitialized array via np.empty((N,2), dtype=np.float32)
and then write the xy-coordinates into that array via a nested cython loop, maybe that would end up being faster.
Just a FYI I'm experiencing similar slowdowns with transforming keypoints taking about 5x longer than heavy augmentations for images...
I was a bit horrified to find out that the keypoint augmentations are done inside a list comp for each point :D this is definitely a big blocker for me.
Thanks @Erotemic for the vectorized implementation! I'll give that a try.
IMO all operations should be vectorized by default and having extra info per keypoint should be the exception. Then again, using numpy string arrays should be pretty fast too (e.g. casting to np.array_
which gives a square byte array and is therefore very fast and doesn't use much RAM).
Other than this issue, I really like imgaug. Thanks @aleju for a great library!
@harpone Did you ever solve this? Did you manage to speed up keypoint augmentations, or find another library? Albumentations is ok, but they don't offer most augmentations for keypoints e.g. elastic.
using albumentations nowadays most of the time, and haven't had much need for keypoint augs since.
But albumentations doesn’t support elastic deformations for key points. Unless I fork/branch and add that feature myself, I think I’m stuck with imagaug.
On Fri, 25 Mar 2022 at 07:53, Heikki Arponen @.***> wrote:
using albumentations nowadays most of the time, and haven't had much need for keypoint augs since.
— Reply to this email directly, view it on GitHub https://github.com/aleju/imgaug/issues/426#issuecomment-1078750173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFYSMI7CBC4PJUHJ5AAXMB3VBVWJHANCNFSM4IXEXBYA . You are receiving this because you commented.Message ID: @.***>
I've been noticing massive slowdowns as I enable more augmenters in my training scripts. There is a lot of Python overhead incurred in the general architecture of the library.
Looking into it I found that keypoint augmentation is taking up most of the time. This is because each keypoint is augmented one at a time, even though all of the keypoints experience the same transformation. Also most every operation on keypoints seems to cause a deep copy, which incurs overhead of constructing new python objects. This may be fine for <10 keypoints, but it really starts to add up.
Demo of Speed Issue
Here is a MWE demonstrating that keypoint augmentation takes the majority of time in a CropAndPad augmenter.
The output is:
Augmenting the keypoints took 10x longer than augmenting an image!
Methods for Mitigation: numpy vectorization
This problem could be avoided if KeypointsOnImage had a better internal representation. Namely, instead of being a list of Keypoint object, it could simply store a single 2D numpy array of the keypoint locations. (Note this same optimization can be applied to bounding boxes).
Even with a conversion overhead I can demonstrate a 10x speedup by simply modifying the
_crop_and_pad_kpsoi
function. Instead of using internal KeypointsOnImage method I convert to a numpy array, perform all operations, and then convert back.I time the old versus the new and assert they result in the same outputs.
This gives us a 10x speedup!
You might note that this example was for 1000 keypoints, and you might think the vectorized solution would be slower for an item with only a few keypoints. It turns out this is untrue. For even 2 keypoints the vectorized solution is faster (44us vs 33us), and for a single keypoint the times are effectively the same.
Conversions are costly!
And note we can do MUCH better than the above implementation. If we take a look at the line profile results we see that the majority of the overhead is simply in doing the conversion from the
List[Keypoint]
backend to a numpy one:This means that if imgaug was able to improve its backends it wouldn't need to convert between formats as many times, so it would see massive speedups.
Proposal for Speedup: Vectorized objects with numpy backends
Ultimately, I think imgaug would benefit from moving to data structure primitives with vectorized backends (like numpy).
It may seem like this would come at the cost of code re-use, but that is not necessarily the case (although this will mean a codebase overhaul). Currently single-item objects like Keypoint/Box/etc are the data structure primitive, and XsOnImage reuses code by iterating over the single-item objects. However, if you were to use multi-item objects as primitives (using numpy backends of course, here is an example of how I implemented something similar with boxes) then you could implement the single-item objects as special cases of the multi-item objects.
That being said, that's probably too big of a change to ask for, however, I think a more realistic goal would be to just implement a
Keypoints
object that stores multiple keypoints as a numpy array, but exposes a__getitem__
to return aKeypoint
object soKeypointsOnImage
logic would work as-is. However, incremental improvements could be made toKeypointsOnImage
such that when it sees thatisinstance(self.keypoints, Keypoints) is True
, it can use faster methods of theKeypoints
object instead of looping over each individual item and incurring so much Python overhead.For instance
KeypointsOnImage.shift
might look like this:and
Keypoints
might have ashift
method that looks like this:Implementing it this way would allow for a gradual shift to the new vectorized style, provide immediate speedups