Use LoopBuilder to multithread Modern plugins

mdoube commented 4 years ago

Describe the bug LoopBuilder is the ImgLib2 way to achieve chunked multithreading, see this for an example (specifically, the 'squared sum' example) https://forum.image.sc/t/imglib2-split-image-into-chunks-for-multi-threaded-processing/37519/4

Modern Connectivity currently uses 8 cursors to achieve multithreading, which is a bit of a hack.

https://github.com/imagej/imagej-ops/commit/6a60315419fad3745a456f9cc2719e7897d7b762#diff-d6c590f31ede52f4518dba0b4a44cf9fR261

                    final Cursor<B> octantCursor1 = Views.flatIterable(interval).cursor();
                    final Cursor<B> octantCursor2 = Views.flatIterable(interval).cursor();
                    final Cursor<B> octantCursor3 = Views.flatIterable(interval).cursor();
                    final Cursor<B> octantCursor4 = Views.flatIterable(interval).cursor();
                    final Cursor<B> octantCursor5 = Views.flatIterable(interval).cursor();
                    final Cursor<B> octantCursor6 = Views.flatIterable(interval).cursor();
                    final Cursor<B> octantCursor7 = Views.flatIterable(interval).cursor();
                    final Cursor<B> octantCursor8 = Views.flatIterable(interval).cursor();

                    octantCursor1.jumpFwd(start);
                    octantCursor2.jumpFwd(start + w);
                    octantCursor3.jumpFwd(start + 1);
                    octantCursor4.jumpFwd(start + w + 1);
                    octantCursor5.jumpFwd(start + w * h);
                    octantCursor6.jumpFwd(start + w * h + w);
                    octantCursor7.jumpFwd(start + w * h + 1);
                    octantCursor8.jumpFwd(start + w * h + w + 1);

                    for (int i = 0; i < steps; i++) {
                        boolean o1 = octantCursor1.next().get();
                        boolean o2 = octantCursor2.next().get();
                        boolean o3 = octantCursor3.next().get();
                        boolean o4 = octantCursor4.next().get();
                        boolean o5 = octantCursor5.next().get();
                        boolean o6 = octantCursor6.next().get();
                        boolean o7 = octantCursor7.next().get();
                        boolean o8 = octantCursor8.next().get();

Additional context Any change to the code needs to be performance benchmarked and reported to @maarzt If it works out, we might be able to improve the performance of other Modern plugins like VolumeFraction

rimadoma commented 4 years ago

The cursors do avoid random access, which slow the op down (I've tested). Specifically the random access needed to read the 8-neighbourhood of a pixel. My bet is that the inner lambda function of LoopBuilder.forEachPixel(T) (or Chunk) will have to avoid it too to achieve performance similar to the current op.

rimadoma commented 4 years ago

If it works out, we might be able to improve the performance of other Modern plugins like VolumeFraction

Sure, worth having a go. ATM ElementFractionWrapper just calls opService.stats().sum(interval) to count foreground voxels, and Interval.size() for total voxels. The former works, because in a BitType image foreground is always 1.

maarzt commented 4 years ago

LoopBuilder won't work for this use case. It supports a maximum number of 6 cursors or random accesses. You could work around this be using Views.collapse(Views.stack(...)) but I wouldn't expect any performance improvements by this.

I would choose a different approach:

You should copy the BitType input image into a UnsignedByteType image. (Using values 0 and 1).
Convolve the UnsignedByteType image with a kernel [1 2] in x, with a kernel [1 4] in you, and with a kernel [1 16] in z. This will encode your 8-neighborhood in the pixels of the UnsignedByteType.
Use a look up table byte[256] to make the encoded 8-neighborhood to the delta-euler-value.

mdoube commented 4 years ago

@maarzt I just did a little performance testing comparing the legacy and modern Connectivity plugins. You can see that even with some hacky multithreading, and avoiding random accesses with final Cursor<B> octantCursor1 = Views.flatIterable(interval).cursor(); the Modern approach is still way slower than the legacy approach. Is there a faster way to access pixels sequentially in an Op? especially pixel neighbourhoods, which is a standard kind of thing to do when processing with a kernel.

Performance timings

Timing Legacy Connectivity
Legacy connectivity took 1111 ms
Legacy connectivity took 876 ms
Legacy connectivity took 873 ms
Legacy connectivity took 1107 ms
Legacy connectivity took 909 ms
Timing Modern Connectivity
Modern connectivity took 14352 ms
Modern connectivity took 14298 ms
Modern connectivity took 14079 ms
Modern connectivity took 14071 ms
Modern connectivity took 14141 ms

Macro code:

print("Timing Legacy Connectivity");
for (i = 0; i < 5; i++){
   start = getTime();
   run("Connectivity");
   end = getTime();
   print("Legacy connectivity took "+end - start+" ms");
}

print("Timing Modern Connectivity");
for (i = 0; i < 5; i++){
   start = getTime();
   run("Connectivity (Modern)", "inputimage=net.imagej.ImgPlus@73672ce");
   end = getTime();
   print("Modern connectivity took "+end - start+" ms");
}

Test image umzc_378p_Apteryx_haastii_head.tif.zip

bonej-org / BoneJ2

Use LoopBuilder to multithread Modern plugins #252