avaneev / avir

High-quality pro HDR image resizing / scaling C++ library, including a very fast, precise, SIMD Lanczos resizer (header-only C++)
MIT License
413 stars 42 forks source link

Multithreading HowTo #8

Open masc4ii opened 5 years ago

masc4ii commented 5 years ago

Thank you for sharing this great resizing library! Results look so good! Now I would like to get it a little faster using multithreading, but I don't know how. In your documentation I found class CImageResizerThreadPool... is this the right class for multithreading? Unfortunately I have no idea how to use it. Could you please share a small example, how to realize multithreaded resizing with your library? Thanks in advance for your help!

avaneev commented 5 years ago

It is an extremely simple front-end. You should first gain knowledge how "worker thread pools" work in general, then you will be able to implement it.

My own implementation looks like this, but it needs custom-programmed thread pool and worker thread objects that call Workload -> process().

class CThreadPool : public avir :: CImageResizerThreadPool, public CWorkerThreadPool { public: int MaxThreadCount; // The number of threads to use.

virtual int getSuggestedWorkloadCount() const
{
    return( MaxThreadCount <= 0 ? CSystem :: getProcessorCount() :
        MaxThreadCount );
}

virtual void addWorkload( CWorkload* const Workload )
{
    VOXERRSKIP( add( new CResizeThread( Workload )));
}

virtual void startAllWorkloads()
{
    startAll();
}

virtual void waitAllWorkloadsToFinish()
{
    VOXERRSKIP( waitAllForFinish() );
}

virtual void removeAllWorkloads()
{
    removeAllThreads();
}

};

masc4ii commented 5 years ago

Thanks for your answer. In past I implemented a thread pool for another topic, so I will have a look if that helps here too. What I did not understand yet: does this strategy help a) to render a single picture faster, or does it b) help to render e.g. 4 pictures at nearly the same time like one (on a QuadCore CPU)? If a) : I don't see howto divide the picture into parts, wouldn't be this necessary somehow?

avaneev commented 5 years ago

On a 4-core processor the resizing speed increases by a factor of 3.2, so it does help to resize images faster. Algorithm divides the image automatically.

Ptomaine commented 5 years ago

Hello Aleksey,

Unfortunately, I've been unable to re-scale an image with my thread pool.

The result is the striped picture: image

But it's good when re-scaled with a single thread: image

The code of the re-scaling thread pool is the following:

using thread_pool_base = thread_pool;
class avir_scale_thread_pool : public avir::CImageResizerThreadPool, public thread_pool_base
{
public:
    virtual int getSuggestedWorkloadCount() const override
    {
        return thread_pool_base::size();
    }

    virtual void addWorkload(CWorkload *const workload) override
    {
        _workloads.push(workload);
    }

    virtual void startAllWorkloads() override
    {
        while (!std::empty(_workloads))
        {
            _tasks.emplace_back(thread_pool_base::enqueue([](auto workload){ workload->process(); }, _workloads.front()));
            _workloads.pop();
        }
    }

    virtual void waitAllWorkloadsToFinish() override
    {
        for (auto &task : _tasks) task.wait();
    }

private:
    std::deque<task_future<void>> _tasks;
    std::queue<CWorkload*> _workloads;
};

Could please help me to investigate why the result is different from the expected? Thanks in advance!

avaneev commented 5 years ago

How do you initialize the thread pool?

avaneev commented 5 years ago

Which thread library are you using?

avaneev commented 5 years ago

It looks like not all threads are actually being executed, maybe some workload queue mistake.

avaneev commented 5 years ago

You also probably need to remove items from _tasks if they are not autoremoved.

Ptomaine commented 5 years ago

Here is the library that I use (attached to the message). thread_pool.txt

Just rename it to *.hpp

Ptomaine commented 5 years ago

The thread pool is utilized like this:

    nstd::avir_scale_thread_pool scaling_pool;
    nstd::avir::CImageResizerVars vars; vars.ThreadPool = &scaling_pool;
    nstd::avir::CImageResizerParamsUltra roptions;
    nstd::avir::CImageResizer<fpclass_dith> image_resizer { 8, 0, roptions};
    image_resizer.resizeImage(image, width, height, 0, new_image.get(), new_width, new_height, channels, 0, &vars);
Ptomaine commented 5 years ago

Removing tasks didn't help:

    virtual void removeAllWorkloads()
    {
        _tasks.clear();
    }
avaneev commented 5 years ago

Make sure thread_pool_base::size() returns correct value - should be the number of processors in the system. I have doubts that thread pool actually runs all workloads, make sure thread pool is functioning correctly. Test it by replacing workload->process(); with something like printf( "thread started\n" );. It should print this string thread_pool_base::size()*2 times.

Ptomaine commented 5 years ago

It does. It returns the right number of cores.

Ptomaine commented 5 years ago

the size of pool: 16 Workload... Workload... Workload... Workload... Workload... Workload... Workload... Workload... Workload... Workload... Workload... Workload... Workload... Workload... Workload...

Ptomaine commented 5 years ago

It prints 15 times...

Ptomaine commented 5 years ago

I changed the default value from:

std::max(std::thread::hardware_concurrency(), 2u) - 1u

to

std::max(std::thread::hardware_concurrency(), 2u)

It didn't change anything. The picture is still striped.

Ptomaine commented 5 years ago

Okay. I've fixed it! I just looked into your code and saw that you use the same workloads two times. My mistake was that I removed workloads right after the first execution. The proper thread pool looks like this:

using thread_pool_base = thread_pool;
class avir_scale_thread_pool : public avir::CImageResizerThreadPool, public thread_pool_base
{
public:
    virtual int getSuggestedWorkloadCount() const override
    {
        return thread_pool_base::size();
    }

    virtual void addWorkload(CWorkload *const workload) override
    {
        _workloads.emplace_back(workload);
    }

    virtual void startAllWorkloads() override
    {
        for (auto &workload : _workloads) _tasks.emplace_back(thread_pool_base::enqueue([](auto workload){ workload->process(); }, workload));
    }

    virtual void waitAllWorkloadsToFinish() override
    {
        for (auto &task : _tasks) task.wait();
    }

    virtual void removeAllWorkloads()
    {
        _tasks.clear();
        _workloads.clear();
    }

private:
    std::deque<std::future<void>> _tasks;
    std::deque<CWorkload*> _workloads;
};
avaneev commented 5 years ago

I'm glad you've got it working, I'll leave this issue opened for others to learn.

Ptomaine commented 5 years ago

Thank you!

masc4ii commented 5 years ago

Thanks at all! Got it working now too!

avaneev commented 3 years ago

@masc4ii Hello! I've registered myself on the magiclantern.fm forum, and posted a couple of messages there, still waiting for moderation approval. Anyway, to get the information faster to you, here's what I've posted there:

"Hi! Have the MLV App authors tried to apply non-linear "saturation" image transformations in a higher resolution, with a later downsizing step? This is not a common technique, but from the DSP standpoint it should look much better. "Aliasing" is not the whole story like in image resizing, there's also "harmonic distortion", which is not as apparent with images as it is with audio. Maybe worth a try.

A follow-up: the same actually applies to "linearization" or sRGB->linear conversion. It's a non-conventional approach and is resource-heavy, but probably it will fix the feel of all these gamma corrections being "not right"."

I would like to add that making e.g. 3x upsampling followed by 3x downsampling is completely safe with AVIR regarding the dynamic range. What is affected is frequency response, but this is "visible" when resizing smaller images mostly. Any high-resolution photo in most cases is already lacking in the highest frequencies due to limitations of the lenses mainly.

avaneev commented 3 years ago

@masc4ii It's even a lot more safer with 2x upsample-downsample cycle (retains an unbelievable 120 dB range), but for best results I think 3x is needed.

avaneev commented 3 years ago

@masc4ii To optimize the things in the pipeline it may be useful to just first upsample, then process the pipeline, then downsample. I do this in professional audio software, that's a very important feature for the users.

masc4ii commented 3 years ago

Hi @avaneev ,

thank you! We'll see when the mods enable your account... 😄 ...yes, here I can read already.

Until now all operations are done on the original resolution of the RAW image. For the final export after processing with ffmpeg we upscale with AVIR. I can imagine, that doing all those processing operations on a upscaled image could give better quality - but that would probably slow down the processing a lot, or am I wrong? One of the biggest downsides of our app currently is processing time for most users. In your example (3x upscaling): I would expect the application to be ~9x slower - am I correct, or didn't I understand correct? What exactly do you mean with conversion sRGB->linear? As far as I know, we do the opposite: RAW sensor data is linear and we convert to "something good-looking or useful" like sRGB or those Log profiles. @ilia3101 : most of the processing was your work - what do you say?! 👀

Do you own a ML capable camera? If you want, I could send you some MLV samples, so you could play with the app a little - if you like.

Thanks so much for your ideas!

avaneev commented 3 years ago

Yes, that will slow-down all processing, maybe not by a factor of 9, but easily by a factor of 7. linear->sRGB conversion is the same situation, it's a non-linear sample mapping, so my proposal applies there, too.

I do not have an ML capable camera now, but I did a lot of "photography" when I had some Canon EOS middlerange camera and a couple of lenses in the past. Here you can see my "photoworks", on my audio samples product pages: https://www.voxengo.com/group/drum-samples/

avaneev commented 3 years ago

@masc4ii This option can be made switchable, e.g. 1x, 2x or 3x oversampling. For short videos or nightly renders one could select 2x or 3x, 1x for other cases. This is a "transparent" option, it does not change things too much, but probably improves perceived dynamic range and "smoothness" of the footage.

avaneev commented 3 years ago

@masc4ii One more note: my subjective feeling says that an image becomes apparently "more vivid" when non-linear sample mapping is applied at an increased resolution, and indeed it looks "smoother", but not in a "blurry" meaning. Dynamic range improvement is not too perceptible for me.

avaneev commented 3 years ago

@masc4ii It looks more "cinematic" I would also say, closer to the "vintage" than "eye-popping" crispness of "modern".

avaneev commented 3 years ago

@masc4ii Of course, it's important to apply any saturation/gamma transformation before the final resize, applying them afterwards won't make things look much better.