SixLabors / ImageSharp

:camera: A modern, cross-platform, 2D Graphics library for .NET
https://sixlabors.com/products/imagesharp/
Other
7.35k stars 849 forks source link

Discussion - Performance issues, and topics for review #130

Closed JimBobSquarePants closed 4 years ago

JimBobSquarePants commented 7 years ago

Originally on https://gist.github.com/antonfirsov/5ae8f3db1c455baec388b84d52f22e5f

ImageSharp - performance issues, and topics for review

1. Already discussed, having consensus:

Parall.For()

Common Vector4 pixel type

Resize performance

Jpeg

2. New, potentially controversial topics

Internal API

Convolution processors

OctreeQuantizer

JimBobSquarePants commented 7 years ago

Thanks for raising all these points @antonfirsov . I've commented below.

1. Already discussed, having consensus:

Parall.For()

  • Benchmark, and define best practices based on benchmark results
  • Apply the best practices (Remove Parallel.For-s, change granularity, etc.)

Agree on both. I want the library to gain as much as possible from being used in parallel. These benchmarks leave us wanting.

Common Vector4 pixel type

  • Implement with bulk operations (Specially the To/From Vector4 conversions ;)

Looking forward to adding this. We need to decide on the range for the properties though. Ideally for maximum performance we wouldn't scale and only clamp when using ToBytes but I'm assuming most consumers would expect any vectorized properties to range from 0-1. Maybe we can default to not scaling but add PackFromVector4(Vector4 vector) and ToNormalizedVector4()?

Resize performance

  • Use ToVector4() bulk conversion
  • Apply data-oriented microoptimizations
  • Reduce resampler window sizes (I suppose most of the values in horizontalValues are near to zero)
    • This point was not part of my experiment, we can expect further speedup from this.

Agree on your first two points but not convinced by the latter. Close to zero won't cut it for many consumers, trust me, they'll spot any difference from other libraries. Resampling accuracy has to be absolutely spot on and I don't want to sacrifice it for a few milliseconds.

Your initial benchmarks without our faster Color struct were extremely promising anyway 2x faster as I recall? That's pretty fast!

Jpeg

Agree (If we could speed up the decoder that would be absolutely brilliant. That's our biggest bottleneck at the mo). Had to reread the code for your last point re the Encoder but yeah. If anything just to clean it up a bit more. Speed is pretty good so far.

2. New, potentially controversial topics

Internal API

Convolution processors

  • Kernels matrices are really small, allocating them on the heap is shooting ourselves in the foot
  • We can invent a new "quasi-dynamic" value type matrix struct instead

We could.... Don't know if it's all that worth it though. We only initialise these arrays once. Plus the Gaussian kernel can be any size.

OctreeQuantizer

  • Benchmark octree PNG encoder vs. palette PNG encoder on a large image (I suppose Octree is really slow now)
  • Review algorithmic complexity
  • How does the tree look like when built from large images?
  • Does insertion perform in O(1)?
    • Data-oriented optimization: Allocate OctreeNode-s as structs in a continous memory area

Don't compare Octree vs Palette. The output quality difference between the two is massive. Octree gives the best output quality/perf combo; see this article for a summary of different approaches. I've already reduced memory consumption dramatically.

Also don't compare our encoding speed with System.Drawing as the output quality is also massive. Dunno what the approach in System.Drawing is but it's absolutely terrible, worse than our palette quantizer. I'd like to see our encoder get closer though. I'd like to see the decoder speed up however. At the mo it's over 2x slower. Same with Gif.

Complexity is unknown at the moment. I had to extend the original algorithm to allow for multiple alpha values to properly support indexed png's I would be interested in experimenting with any memory optimizations though.

antonfirsov commented 7 years ago

@JimBobSquarePants

Re: Convolution processors:

We only initialise these arrays once.

It's not about the cost of initializing an array, it's about heap allocated VS stack allocated data. With a stack allocated kernel matrix you can be 100% sure the coefficients are always in the L1 cache, and you have a pretty high chance they actually fit into CPU registers. It's a common practice in imaging libraries to keep this data stack-allocated. The difference in perf should be huge.

Re: Resize resampler window

Close to zero won't cut it for many consumers, trust me, they'll spot any difference from other libraries.

I mean, about >=80% of the values in the window could be <0.0001. You can safely drop them without loss of quality, no human being can spot it :) Most libraries do this stuff with small kernel matrices. Using a window of destinationWidh size seems to be a large waste of CPU resources.

Re: Octree

Don't compare Octree vs Palette.

I don't want to. I just want to have a benchmark that proves, that the current Octree implementation is horribly slow :) I mean, in orders of magnitude slower than it should be! There are two major issues I noticed in the code:

I'd really like to see an up-to-grab issue for each of these topics. It can help keeping track of our work, and if we are lucky, new contributors could pick them.

antonfirsov commented 7 years ago

@JimBobSquarePants Here is a list of topics I'd like to open issues for:

Points that might be simpler to solve without opening issues (?), but we still need to track our work:

UPDATE The most urgent topics are the ones that affect the public API. (At least if we don't want the library to stick with slow-by-design API-s ...)

antonfirsov commented 7 years ago

I go on with my monolouge here, trying to give a more general context to my proposals.

On modern architectures the most expensive CPU operation is accessing the memory: MemLatency

It means that code optimizations should focus on reducing cache misses. This can be achieved by making good choices on data structures and reducing non-cache-friendly operations (like virtual calls) on hot paths.

All in all:

I know this is a relatively new thing in the world of .NET library development (prepared by Joe Duffy-s brilliant work, and driven by .NET Core's business goals today), but we are the guys who love to learn new things aren't we? :)

I think if we want to develop this library with a performance-as-a-feature mindset, it's mandatory to change our coding style to follow DOD principles!

tocsoft commented 7 years ago

RE breaking apis for the convolution processors and all other processors

I think the first should be, instead of trying to figure out and agree on the correct api would be to make all our implementations of IImageProcessor<TColor> internal, as for most users they are just going to you the extension methods anyway so if we make them all internal then we can break them internally as we fix/update them to be more efficient without breaking the public APIs.

On a wider note I think its probably worth doing a pass across the whole library and internalise everything that we can, basically leaving interfaces, extension methods, colors the Image/Image<TColor> objects and anything else needed to make them compile, basically try to reduce down our public API to just the barest minimum surface area we can so we have the most flexibility in refactoring the core as required.

I think for out first public release we could even keep the PixelArea stuff internal too as i can't imagine most of our developers will be doing must more than the simpler image manipulations, open file cop, resave etc and that's all going to be possible from the extension methods.

Then later we can make a decision on each api on a case by case basis as we find a developer that needs access to that lower level stuff.

JimBobSquarePants commented 7 years ago

@antonfirsov Apologies for the slow reply on this, there's a lot to digest.

Octree

Regarding the Octree Quantizer with a complexity of O(n*n)? There's already a maximum depth of 8 for each level. An Octree is initialized with that limit so I don't understand how you would reduce that maximum value and continue to utilize that algorithm.

I do see though how we can improve on the memory layout of nodes though by turning them into structs so I encourage that improvement.

I've knocked up a proper benchmark comparing all our Quantization algorithms (Palette, Octree, Wu) by encoding an indexed Png with quality set to 256. The original benchmark we had comparing to System.Drawing didn't ask the right questions as System.Drawing cannot save indexed Png's and therefore would never use any quantization.

Method LargeImage Mean StdErr StdDev Median Scaled Scaled-StdDev Gen 0 Gen 1 Gen 2 Allocated
'ImageSharp Octree Png' False 353.8687 ms 3.1196 ms 12.0823 ms 352.1816 ms 1.00 0.00 800.0000 583.3333 237.5000 9.16 MB
'ImageSharp Octree NoDither Png' False 76.2996 ms 0.7592 ms 6.7054 ms 78.5256 ms 0.22 0.02 412.7907 225.2907 - 3.08 MB
'ImageSharp Palette Png' False 434.3553 ms 4.4461 ms 37.4638 ms 423.4140 ms 1.23 0.11 425.5952 288.6905 232.1429 6.85 MB
'ImageSharp Palette NoDither Png' False 86.8364 ms 1.1569 ms 11.5107 ms 86.4912 ms 0.25 0.03 417.5000 292.5000 230.0000 1.99 MB
'ImageSharp Wu Png' False 144.5337 ms 1.4385 ms 13.8720 ms 142.1108 ms 0.41 0.04 - - - 872.71 kB

As you can see it's the dithering algorithm that dramatically slows the implementation down. Dithering is slow though as we touch multiple pixels on each pixel.

Octree without dither is actually the fastest of our algorithms.

I have noticed an implementation issue though in the Octree quantizer, we're using euclidean distance to look up the palette when dithering instead of the Octree QuantizePixel method. That's wrong and gives us the wrong result so it should be switched out. There's should be a simple optimization we can make to check against the previous pixel to speed up the second pass and reduce lookups. That will be a quick win.

I'd like to get to grips with all that GC collection also.

Resizing

I'm still not convinced we can use a smaller weight map when resampling. Each item within the map is a required index and doing different seems to go against information and code samples I have read.

http://entropymine.com/imageworsener/resample/

We're already scaling our maps correctly only performing that calculation once.

I'd like to see the bulk operation resizing algorithm in action to see the speedup there before we even consider touching weighted maps. Our unpacked Color struct should also be a major priority.

Convolution

What kind of struct do you have in mind for that? We are already using the custom Fast2dArray<T> struct for those operations.

DOD etc

Yes, but let's not be too hard on ourselves. The library isn't slow.

API changes

Well it's a good thing we've had almost a 2 year alpha then isn't it 😄 I'm all for making all but our desired endpoints internal except for the IImageProcessor implementation classes. They should be protected as they can be used for combining imaging effects see the EntropyCropProcessor for example.

antonfirsov commented 7 years ago

TLDR

I'm asking for:

On internalization

~I'm happy we agreed on internalization!~ This way these points will not block the beta. Customization and extensibility is not necessarily a 1.0 feature anyways :)

UPDATE After a second careful read, I noticed that you don't want to internalize processor implementations. With processors where the perf issues are caused by design choices you have 3 options then:

None of them sounds good to me.

Regarding topics I started to discuss:

I'm still quite sure (let's say 80% sure), that my assumptions are true on all of them. At this point I base my critiques on investigation of the code and my knowledge on the specific algorithms and cache friendly development practices (mostly adopted from C++). The problem is that even a proper benchmark based proof, or other proper investigation could take a large amount of time, for each topic! As you see, even discussing them in a single issue is quite hard. That's why I suggested opening individual issues for all of them.

If we open them, I will continuously add the results of my investigations to the issues, proving that we can improve a lot on each topics! If the investigation shows that I was wrong, I will be happy to close the issue. (But it wont! :)

Also, if a new developer wants to join in, and help with perf, this could be a good guide about where to start.

antonfirsov commented 7 years ago

RE Octree

There's already a maximum depth of 8 for each level.

I mean the total depth of the tree, not individual levels!

What was the image size on your benchmarks? I noticed issues with large images (2000*2000 min). A proper benchmark would have the image size in MegaPixels as [Params] and check the growth of computation time against the growth of the input size. I'm quite sure it won't be linear.

UPDATE:

RE Convolution kernels

Fast2DArray<T> is just a wrapper over a heap object. I mean real value type object like Matrix4x4. I know this is not trivial because of the different kernel sizes, but I have an idea to solve this! You can expect 2x speedup or more! But I need time to elaborate it, that's why I'm suggesting internalization.

antonfirsov commented 7 years ago

The library isn't slow.

Depends on the benchmarks you use. It's not hard to construct benchmarks (for real life scenarios!) that prove the opposite :P Okkay, maybe it's not slow, but there is a lot of space for improvements! And I know the specific points where we can improve! :)

tocsoft commented 7 years ago

@JimBobSquarePants internalising the processors would not stop EntropyCropProcessor working, yes it would make it more difficult for a third party to create one but that's actually a separate issue and I believe shouldn't be a goal of first release.. we want to get something out that will help most users, and most aren't going to want to create custom processors.

We should worry about extensibility as a v2/v1.1 goal not v1 lets just get v1 out with our locked down processors and worry about unlocking them when and only when a third party wants in made public and not before.... we might never have a developer who wants to make a processor like that and thus we might never need to make them public.

JimBobSquarePants commented 7 years ago

OK, ok, you've got me.....

Let's

A. Internalize absolutely everything we don't need and only publicize objects on demand. B. Get issues opened for various discussion points.

I can take on A if you like. Easy win.

P.S I'm really enjoying these discussions. I think it brings out the best in all of us 🥇

blackcity commented 7 years ago

Good to see this great discussion. We compare this library against our existing solutions with PresentationCore on the traditional framework. In some use cases we bulk-process multiple files with 5MB and more. In this scenario ImageSharp tends to be 2-3 times slower than PresentationCore. For instance, processing 5 images with 10MB: PresentationCore (10sec), ImageSharp (30sec). I don't know if the library can get faster with pure .NET code optimizations or if its only possible with native code as @antonfirsov mentioned. But once the library is published it will be benchmark tested with other solutions and libraries. So, it's better to keep an eye on this first.

So, totally agree to @antonfirsov

JimBobSquarePants commented 7 years ago

I think we can make some pretty major improvements with some of the changes. We might never beat PresentationCore (That's really fast) but I think we can get a lot closer.

I'm hoping that we can attract some of the major performance players that got stuck into Asp.Net Core. I think many are put of though as they expect a requirement to understand the image processing algorithms.

antonfirsov commented 7 years ago

@blackcity I'm not saying I want to optimize with native code. I'm saying I want to adapt DOD techniques which are usually uncommon in pure C# code, but quite common in C/C++ projects.

Just being curious: what is your use case? (Input format, typical image sizes, applied processors, output format)

JimBobSquarePants commented 7 years ago

See #137 For part A. I'm off to bed now.

blackcity commented 7 years ago

@antonfirsov Sorry, DOD of course. In general, the images we process can be of any size, from 1MB to 50MB. The test is not very sophisticated, just System.Diagnostics.StopWatch. We processed (resize/crop/canvas) 5 JPG files with 10MB on a Quad-Core processor and outputted also to JPG.

A while ago we ported some of our applications to .NET Core. In traditional .NET we use PresentationCore for image processing. After some research and tests we think that ImageSharp has the capability to become the standard library for .NET Core regarding image processing.

@JimBobSquarePants Most important for us is cross-plattform. Then comes performance and quality. Quality is quite good. And we don't really expect the library to beat PresentationCore. By the way, PresentationCore has it's own oddities and, most important, I can't run it on Linux or Mac ;-)

antonfirsov commented 7 years ago

@JimBobSquarePants happy to see #137 open! I will take care of B, continously adding my points.

vpenades commented 7 years ago

Hi, sorry to throw this here if it's not the right place.

I was looking at the Color structure, which is essentially a wrap over "uint packedValue;" , and, in order to access the RGBA components, it uses the classic bit-shift and mask operations, which are probably quite fast.

In the past, instead of using bit-shift/mask operations, I got used to use this trick:

[System.Runtime.InteropServices.StructLayout(System.Runtime.InteropServices.LayoutKind.Explicit, Size = 4)]
struct Color
{
[System.Runtime.InteropServices.FieldOffset(0)]
public UInt32 PackedValue;

[System.Runtime.InteropServices.FieldOffset(0)]
public Byte Blue;
[System.Runtime.InteropServices.FieldOffset(1)]
public Byte Green;
[System.Runtime.InteropServices.FieldOffset(2)]
public Byte Red;
[System.Runtime.InteropServices.FieldOffset(3)]
public Byte Alpha;
}

It might look weird and ugly, it is also sensitive to endianness, but it completely removes the bitshiftmask operations, which might improve performance.

antonfirsov commented 7 years ago

which might improve performance.

Might or might not :) [System.Runtime.InteropServices.FieldOffset(1)] is an unaligned costruct, therefore it might be implemented with more or more expensive CPU instuctions than the "more aligned" [System.Runtime.InteropServices.FieldOffset(4)] case.

The best way to investigate the FieldOffset VS bitshift question is to create two benchmarks comparing these implementations.

jackmott commented 7 years ago

Regarding Parallel.For usage, I apologize if everyone here is already familiar with it, but I stumbled upon this today, which is that when using Parallel.For with arrays or List you can get drastically better performance by using Parallel.ForEach along with a Partitioner. It is typically ~2x to ~4x faster depending on the work load and data width, and core count of the machine. This happens because you gaurantee to get big chunks of data that are accessed in order, and take advantage of CPU caches.

The default Partitioner.Create does batches of size length/(logicalCoreCountt*3) but you can pick your own with Partitioner.Create(from,to,batchsize) to tune it for the workload.

var rangePartitioner = Partitioner.Create(0, a.Length);
    Parallel.ForEach(rangePartitioner,
        () => 0,
        (range, s, acc) =>
        {        
             for (int i = range.Item1; i < range.Item2; i++)
             {
                    acc += a[i];
             }         
             return acc;
         },
         acc => Interlocked.Add(ref sum, acc));
JimBobSquarePants commented 7 years ago

Thanks @jackmott I didn't actually know about the improvements in that regard.

Definitely worth taking the time to experiment with the processors to see what we come up with.

antonfirsov commented 7 years ago

@jackmott @JimBobSquarePants I was also thinking about fine-tuning the workload sizes, and came to the conclusion that we need a nested range-based solution to reduce cache misses and false sharing, having a for(i=range.Min;i<range.Max;i++) -style loop inside the parallel lambda.

I wasn't aware that Partitioner could simplify our job here, so thanks for the hint! 👍

JimBobSquarePants commented 7 years ago

@jackmott @antonfirsov

Is there a known theory we can work from for determining these partition ranges or is it a case of identifying ranges based on struct sizes etc and experimenting via benchmarking?

jackmott commented 7 years ago

No it depends on the workload. The default Create will spawn numLogicalCores*3 tasks, which is more than you want if you are just iterating over some arrays of value types and doing some maths, but might be less than you want if you are iterating over a file and loading bits at a time from disk or something. Have to experiment.

Also, if you are just iterating over arrays and doing some maths, if System.Numerics has the math operations you are doing covered (or nearly so, you can drop out to scalar occasionally), that will tend to work out much better than parallelizing things. No overhead managing the tasks, doesn't care how many cpus the machine has. Any 64bit build is guaranteed to be running on a machine with at least SSE.

I'm not really british I just like saying maths.

BrianJThomson commented 7 years ago

@blackcity See you have done some profiling with WPF. We have a discussion on this thread #151 about memory issues. Can you provide some of your data? Might help.

blackcity commented 7 years ago

Mmmh, yeah, but I don't know if this has any benefit. I have to deal with very large files, more than 1GB in general and up to 50MB when it comes to image files. So, we need do automated profiling.

JimBobSquarePants commented 6 years ago

I don't know. You haven't supplied a version number, dimensions, colorspace info, whether it's baseline or progressive, nor the image itself.

antonfirsov commented 6 years ago

@TechnikEmpire please, post your image here, and I will investigate it! Our jpeg decoder is strongly optimized with SIMD for a certain super-common category of images, but not yet for others. There is a lot of work here, and we are a very small team.

Reacting your deleted post: Yeah, @JimBobSquarePants was a bit rude here, but let's not start flaming each other here, and get back to a constructive path! I'm sure everyone here wants to make this library better. @JimBobSquarePants spent crazy amounts of time to start the library and shape a good API. When I joined the team, I did the same to make it faster. I strongly believe, that in long term it's possible to achieve close-to native performance. It's just a matter of invested work.

TechnikEmpire commented 6 years ago

@antonfirsov I deleted my post because even though I still think it was condescending 1) I tend to get mouthy too fast for which I apologize and 2) it appears that an in-progress hardware failure I was unaware of till now is to blame. It was the IO process itself that was taking roughly 350 msec and not your library. I have a data drive where I store masses of image data for various classification experiments I run and apparently the constant abuse has taken a toll. So, I eat my words and apologize for jabbing performance.

antonfirsov commented 6 years ago

@TechnikEmpire no worries, I'm glad, you figured it out!

Also good to know, if you have have came across the posts of a guy on StackOverflow evangelizing that the library is slow: he was "benchmarking" it in Debug mode.

TechnikEmpire commented 6 years ago

@antonfirsov of course, stack overflow is a hole. Perf isn't bad, I'm scaling down images to 220 square before running my algo ported from using opencv and I'm only losing a few msec that way. Again I eat my words and honestly I should have figured that such atrocious load times was due to something else.

JimBobSquarePants commented 6 years ago

@TechnikEmpire Sorry if I came across as condescending, it's been a rough couple of weeks and I'm not great at online conversation at times. I'm glad you found the solution and I hope the hardware will not be too expensive to replace.

Your assistance at any time would be, of course, most welcome.

TechnikEmpire commented 6 years ago

@JimBobSquarePants No worries, apologies for being a snarky jerk. I'm on a mission to create a binary image classification library, but it's a low priority in the mountain of work I have to do. If I can make time and find something I can contribute, I definitely will. Thanks for publishing your hard work.

antonfirsov commented 4 years ago

Closing this, since the topics are too generic now. Most problems have been addressed by today, and we need specific issues for the rest. Octree perf is still worth an investigation.