ermig1979 / Simd

C++ image processing and machine learning library with using of SIMD: SSE, AVX, AVX-512, AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC, NEON for ARM.
http://ermig1979.github.io/Simd
MIT License
2.03k stars 406 forks source link

New resize method between Bicubic and Area #206

Closed mikeversteeg closed 1 year ago

mikeversteeg commented 2 years ago

For my purpose, resizing video, the first 3 resize methods (Nearest, Bilinear, Bicubic) are all fast and offer much the same (poor) quality. Area offers excellent quality but at much higher (in my case doubled!) CPU usage. Is it possible to create a method positioned under Area? Instead of sampling all pixels of the scaled down area perhaps sample a raster? E.g. to scale down a factor of 4, not sample all 16 pixels but only 4? I'm no expert, perhaps there are other methods that position in quality and speed under Area..

Thanks.

ermig1979 commented 2 years ago

Hello! Could you get me typical size of source and output image? It will help me to make right choise. In Area method to make better performance we can take for example only even pixels or take their sum with the same coefficients.

mikeversteeg commented 2 years ago

Good question. Unfortunately a typical setup scales down a few dozen video signals in all kind of resolutions. This could be down to 25% of original (within same colour model, i.e. YUV420P or YUV422P). For video preview I often scale down even further to around 8% (YUV420P/YUV422P to YUV444P to RGB). Original typically is full HD (1920*1080). If you like I can run some real life benchmarks on different hardware.

ermig1979 commented 2 years ago

I have an idea: I can unite ReduceColor2x2 and ResizeArea algorithm. This can get performance gain ~1.5-2.0 times.

mikeversteeg commented 2 years ago

Interesting idea. Maybe I should test the algorithm first to see what the result looks like on typical video streams?

ermig1979 commented 2 years ago

Yes. I am implementing Base version of the algorithm now. It is reasonable to check it before I will start optimizations.

ermig1979 commented 2 years ago

The Base implemenation of SimdResizeMethodAreaReduced2x2 resize method in SimdResizer was done. Could you check quality of the resizing algorithm?

mikeversteeg commented 2 years ago

Will do!

mikeversteeg commented 2 years ago

I'm afraid it doesn't work, or do I need to change something else besides the resize method? Depending on the size reduction there is no image or it has a green (Y) haze over it..

mikeversteeg commented 2 years ago

Is SimdLib threadsafe?

mikeversteeg commented 2 years ago

It's really weird. If I start small and increase target size slowly I see Y-only, some size also have U&V, some ranges show gradients and above 50% the entire target turns green (as in YUV=0). I initially though it may have been a thread safety issue but excluded that.

ermig1979 commented 2 years ago

Resizer uses single thread algorithm and does not use any synchronization primitives inside. It supposes that input and output data is not changed in other thread. Otherwise you have to protect the data outside.

mikeversteeg commented 2 years ago

So at exactly 50% the resizer works and at 25% it works reasonably (but no colours) which enabled me to test your algorithm. I have to say I am having a very hard time finding any differences? If this is at almost half the CPU then good job! Once the bug is fixed I will run more tests with other images and smaller sizes. image

ermig1979 commented 2 years ago

I fixed some bugs. Could you check: may be the error disappeared.

mikeversteeg commented 2 years ago

I'm trying but cannot find the new version, just v4.9.112 of this morning.

mikeversteeg commented 2 years ago

How do I download the zip?

ermig1979 commented 2 years ago

No. There are updates after release.

mikeversteeg commented 2 years ago

If I click on green Code button I can only download the zip file of the master, which is 4.9.112 of this morning (all files timestamped 11;54 local). I do not know how to download the update.

mikeversteeg commented 2 years ago

I decided today to learn a bit about how github works (never used it) and came to the conclusion downloading the Master is correct. So I downloaded it again, ignored the source file timestamps and built the solution. Indeed this is the new version with bug fixes, I guess the timestamps are not to be relied upon.. I am pleased to tell you that the bugs have been fixed and performance of SimdResizeMethodAreaFastis visually on par with the old SimdResizeMethodArea and currently at roughly the same CPU usage. So this looks very promising! Great work!

mikeversteeg commented 2 years ago

Haven't heard from you in a while.. Just to verify: you are not waiting for me to do something?

ermig1979 commented 2 years ago

Hi! I was busy last weeks with another project. As I remember I finished optimization of AreaFast method for SSE4.1, AVX2 and partial for AVX-512. There is ResizeYuv420pSpecialTest test to verify quality of different resize methods. There are no significant difference between Area and AreaFast methods.

mikeversteeg commented 2 years ago

The version I tested already had SIMD optimisations? Mmm.. As I reported there was no improvement in CPU usage :/

ermig1979 commented 2 years ago

I added optimizations 6 april. So this happened after your check.

mikeversteeg commented 2 years ago

Great to hear you released a new version, I will test immediately!

mikeversteeg commented 2 years ago

I'm afraid the news is not good.. There is no improvement in CPU usage compared to the previous release, CPU usage for SimdResizeMethodAreaand SimdResizeMethodAreaFastare the same. What is odd is that the version number reported is now back to 4.9.112, while previous release was 4.9.113. I downloaded and built both the master from github as well as the master from your site.

ermig1979 commented 2 years ago

Thanks for response. I will try to fix the problem.

ermig1979 commented 2 years ago

I did some fix. AreaFast (ArF-b) looks in 1.5 times fastre then Area (ArO-b) method (see API column):

-------------------------------------------------------------------------------------------------------------------------------------
| Function                                |   API  Base  Sse2 Sse41   Avx  Avx2 | Bs/S2 Bs/S4 Bs/A1 Bs/A2 | Bs/S2 S2/S4 S4/A1 A1/A2 |
-------------------------------------------------------------------------------------------------------------------------------------
| Common, ms                              | 0.527 1.494 0.997 0.599 0.596 0.533 |  1.50  2.49  2.50  2.80 |  1.50  1.66  1.00  1.12 |
-------------------------------------------------------------------------------------------------------------------------------------
| ResizerInit[1:1919x1081->299x168:ArF-b] | 0.221 0.497 0.504 0.249 0.256 0.223 |  0.99  2.00  1.94  2.23 |  0.99  2.03  0.97  1.15 |
| ResizerInit[1:1919x1081->299x168:ArO-b] | 0.333 0.993 0.391 0.372 0.363 0.333 |  2.54  2.67  2.73  2.98 |  2.54  1.05  1.02  1.09 |
| ResizerInit[1:1920x1080->240x135:ArF-b] | 0.163 0.362 0.374 0.201 0.199 0.164 |  0.97  1.80  1.82  2.21 |  0.97  1.86  1.01  1.22 |
| ResizerInit[1:1920x1080->240x135:ArO-b] | 0.228 0.787 0.303 0.259 0.254 0.227 |  2.60  3.04  3.10  3.47 |  2.60  1.17  1.02  1.12 |
| ResizerInit[1:1920x1080->299x168:ArF-b] | 0.217 0.483 0.494 0.251 0.248 0.215 |  0.98  1.92  1.95  2.24 |  0.98  1.97  1.01  1.15 |
| ResizerInit[1:1920x1080->299x168:ArO-b] | 0.310 0.969 0.391 0.372 0.363 0.323 |  2.48  2.61  2.67  3.00 |  2.48  1.05  1.02  1.12 |
| ResizerInit[1:1920x1080->480x270:ArF-b] | 0.398 0.555 0.563 0.384 0.384 0.395 |  0.98  1.45  1.45  1.41 |  0.98  1.47  1.00  0.97 |
| ResizerInit[1:1920x1080->480x270:ArO-b] | 0.526 1.039 0.567 0.556 0.543 0.524 |  1.83  1.87  1.91  1.98 |  1.83  1.02  1.02  1.04 |
| ResizerInit[2:1919x1081->299x168:ArF-b] | 0.428 1.213 1.204 0.501 0.496 0.423 |  1.01  2.42  2.45  2.87 |  1.01  2.41  1.01  1.17 |
| ResizerInit[2:1919x1081->299x168:ArO-b] | 0.623 1.833 0.669 0.606 0.601 0.638 |  2.74  3.02  3.05  2.87 |  2.74  1.10  1.01  0.94 |
| ResizerInit[2:1920x1080->240x135:ArF-b] | 0.302 0.915 0.900 0.351 0.359 0.302 |  1.02  2.60  2.55  3.03 |  1.02  2.56  0.98  1.19 |
| ResizerInit[2:1920x1080->240x135:ArO-b] | 0.401 1.512 0.557 0.451 0.450 0.395 |  2.71  3.35  3.36  3.83 |  2.71  1.24  1.00  1.14 |
| ResizerInit[2:1920x1080->299x168:ArF-b] | 0.423 1.200 1.183 0.495 0.488 0.432 |  1.01  2.42  2.46  2.78 |  1.01  2.39  1.01  1.13 |
| ResizerInit[2:1920x1080->299x168:ArO-b] | 0.617 1.811 0.677 0.601 0.607 0.625 |  2.68  3.01  2.99  2.90 |  2.68  1.13  0.99  0.97 |
| ResizerInit[2:1920x1080->480x270:ArF-b] | 0.717 1.097 1.106 0.777 0.771 0.723 |  0.99  1.41  1.42  1.52 |  0.99  1.42  1.01  1.07 |
| ResizerInit[2:1920x1080->480x270:ArO-b] | 0.929 2.027 1.111 0.951 0.940 0.919 |  1.82  2.13  2.16  2.20 |  1.82  1.17  1.01  1.02 |
| ResizerInit[3:1919x1081->299x168:ArF-b] | 0.613 1.977 1.970 0.727 0.742 0.610 |  1.00  2.72  2.66  3.24 |  1.00  2.71  0.98  1.22 |
| ResizerInit[3:1919x1081->299x168:ArO-b] | 0.683 2.757 1.072 0.863 0.869 0.695 |  2.57  3.19  3.17  3.96 |  2.57  1.24  0.99  1.25 |
| ResizerInit[3:1920x1080->240x135:ArF-b] | 0.462 1.576 1.556 0.576 0.569 0.459 |  1.01  2.74  2.77  3.43 |  1.01  2.70  1.01  1.24 |
| ResizerInit[3:1920x1080->240x135:ArO-b] | 0.592 2.316 0.957 0.653 0.660 0.610 |  2.42  3.54  3.51  3.80 |  2.42  1.47  0.99  1.08 |
| ResizerInit[3:1920x1080->299x168:ArF-b] | 0.595 1.944 1.971 0.717 0.726 0.594 |  0.99  2.71  2.68  3.27 |  0.99  2.75  0.99  1.22 |
| ResizerInit[3:1920x1080->299x168:ArO-b] | 0.691 2.710 1.094 0.875 0.901 0.701 |  2.48  3.10  3.01  3.86 |  2.48  1.25  0.97  1.28 |
| ResizerInit[3:1920x1080->480x270:ArF-b] | 0.755 1.795 1.880 0.836 0.845 0.746 |  0.95  2.15  2.12  2.41 |  0.95  2.25  0.99  1.13 |
| ResizerInit[3:1920x1080->480x270:ArO-b] | 1.025 3.135 1.864 1.026 1.023 1.026 |  1.68  3.05  3.07  3.06 |  1.68  1.82  1.00  1.00 |
| ResizerInit[4:1919x1081->299x168:ArF-b] | 0.650 1.796 1.788 0.796 0.818 0.649 |  1.00  2.26  2.20  2.77 |  1.00  2.25  0.97  1.26 |
| ResizerInit[4:1919x1081->299x168:ArO-b] | 0.937 3.499 1.609 1.062 1.046 1.004 |  2.17  3.30  3.34  3.49 |  2.17  1.52  1.01  1.04 |
| ResizerInit[4:1920x1080->240x135:ArF-b] | 0.513 1.365 1.361 0.607 0.634 0.514 |  1.00  2.25  2.15  2.66 |  1.00  2.24  0.96  1.23 |
| ResizerInit[4:1920x1080->240x135:ArO-b] | 0.810 3.053 1.384 0.869 0.863 0.869 |  2.21  3.51  3.54  3.51 |  2.21  1.59  1.01  0.99 |
| ResizerInit[4:1920x1080->299x168:ArF-b] | 0.646 1.764 1.754 0.803 0.818 0.647 |  1.01  2.20  2.16  2.73 |  1.01  2.18  0.98  1.26 |
| ResizerInit[4:1920x1080->299x168:ArO-b] | 0.936 3.462 1.573 1.083 1.211 1.009 |  2.20  3.20  2.86  3.43 |  2.20  1.45  0.89  1.20 |
| ResizerInit[4:1920x1080->480x270:ArF-b] | 0.796 1.938 1.953 0.918 0.906 0.783 |  0.99  2.11  2.14  2.47 |  0.99  2.13  1.01  1.16 |
| ResizerInit[4:1920x1080->480x270:ArO-b] | 1.325 3.659 2.421 1.659 1.249 1.439 |  1.51  2.21  2.93  2.54 |  1.51  1.46  1.33  0.87 |
-------------------------------------------------------------------------------------------------------------------------------------
mikeversteeg commented 2 years ago

Thanks, downloading..

mikeversteeg commented 2 years ago

Unfortunately still no CPU or speed difference between SimdResizeMethodAreaand SimdResizeMethodAreaFast. Version number reported is still 4.9.112.

PS: do not understand the table above, you only test SimdResizerInit, not SimdResizerRun?

ermig1979 commented 2 years ago

Its OK. The test runs SimdResizerRun. Test framework used in Simd is developed to test functions. Resizer consists from calling of 3 function Init, Resize and Release so it annotates only the first of them. To better understand see ResizerAutoTest.

mikeversteeg commented 2 years ago

Not sure if I am using the right version (as you do not increment the build number and timestamps are meaningless) or your code just doesn't run efficiently on my PC so I figured I'd try your test. But I cannot find it. I ran Text.exe and it does not run anything with the name resize in it either. It does not complete though, as it runs into an error.

[000] Info: DetectionHaarDetect32fpAutoTest is started :
[000] Error: Can't load cascade '../../data/cascade/haar_face_0.xml' !
[000] Error: DetectionHaarDetect32fpAutoTest has errors. TEST EXECUTION IS TERMINATED!
ermig1979 commented 2 years ago

Try to use filter: ./Test -fi=Resizer

mikeversteeg commented 2 years ago

Your test shows the same improvement here, yet in my app it does not offer any improvement. I wonder why that is, even more so because the other filters do offer a major improvement (at lesser quality).. PS: SSE2 is weird?

-------------------------------------------------------------------------------------------------------------------------------------
| Function                                |   API  Base  Sse2 Sse41   Avx  Avx2 | Bs/S2 Bs/S4 Bs/A1 Bs/A2 | Bs/S2 S2/S4 S4/A1 A1/A2 |
-------------------------------------------------------------------------------------------------------------------------------------
| Common, ms                              | 0.431 1.552 0.893 0.537 0.552 0.452 |  1.74  2.89  2.81  3.43 |  1.74  1.66  0.97  1.22 |
-------------------------------------------------------------------------------------------------------------------------------------
| ResizerInit[1:1919x1081->299x168:ArF-b] | 0.198 0.625 0.633 0.242 0.257 0.192 |  0.99  2.58  2.43  3.26 |  0.99  2.61  0.94  1.34 |
| ResizerInit[1:1919x1081->299x168:ArO-b] | 0.299 1.323 0.349 0.345 0.357 0.327 |  3.79  3.84  3.70  4.05 |  3.79  1.01  0.96  1.09 |
| ResizerInit[1:1920x1080->240x135:ArF-b] | 0.146 0.503 0.476 0.179 0.185 0.153 |  1.06  2.81  2.71  3.28 |  1.06  2.66  0.96  1.21 |
| ResizerInit[1:1920x1080->240x135:ArO-b] | 0.236 1.091 0.314 0.276 0.325 0.229 |  3.47  3.95  3.36  4.76 |  3.47  1.14  0.85  1.42 |
| ResizerInit[1:1920x1080->299x168:ArF-b] | 0.186 0.609 0.683 0.228 0.224 0.181 |  0.89  2.67  2.72  3.36 |  0.89  3.00  1.02  1.24 |
| ResizerInit[1:1920x1080->299x168:ArO-b] | 0.307 1.445 0.357 0.346 0.366 0.327 |  4.04  4.18  3.95  4.42 |  4.04  1.03  0.94  1.12 |
| ResizerInit[1:1920x1080->480x270:ArF-b] | 0.303 0.651 0.672 0.344 0.359 0.300 |  0.97  1.89  1.81  2.17 |  0.97  1.95  0.96  1.19 |
| ResizerInit[1:1920x1080->480x270:ArO-b] | 0.457 1.372 0.579 0.463 0.503 0.470 |  2.37  2.97  2.73  2.92 |  2.37  1.25  0.92  1.07 |
| ResizerInit[2:1919x1081->299x168:ArF-b] | 0.347 1.050 1.066 0.436 0.477 0.369 |  0.98  2.41  2.20  2.85 |  0.98  2.44  0.91  1.29 |
| ResizerInit[2:1919x1081->299x168:ArO-b] | 0.508 1.976 0.645 0.661 0.625 0.540 |  3.07  2.99  3.16  3.66 |  3.07  0.97  1.06  1.16 |
| ResizerInit[2:1920x1080->240x135:ArF-b] | 0.247 0.866 0.809 0.301 0.319 0.257 |  1.07  2.87  2.72  3.37 |  1.07  2.68  0.95  1.24 |
| ResizerInit[2:1920x1080->240x135:ArO-b] | 0.376 1.819 0.504 0.439 0.466 0.391 |  3.61  4.14  3.90  4.65 |  3.61  1.15  0.94  1.19 |
| ResizerInit[2:1920x1080->299x168:ArF-b] | 0.318 1.042 1.053 0.417 0.416 0.488 |  0.99  2.50  2.51  2.14 |  0.99  2.52  1.00  0.85 |
| ResizerInit[2:1920x1080->299x168:ArO-b] | 0.473 1.973 0.598 0.593 0.611 0.488 |  3.30  3.32  3.23  4.04 |  3.30  1.01  0.97  1.25 |
| ResizerInit[2:1920x1080->480x270:ArF-b] | 0.480 1.026 1.029 0.533 0.572 0.486 |  1.00  1.92  1.79  2.11 |  1.00  1.93  0.93  1.18 |
| ResizerInit[2:1920x1080->480x270:ArO-b] | 0.728 1.953 0.844 0.734 0.799 0.739 |  2.31  2.66  2.45  2.64 |  2.31  1.15  0.92  1.08 |
| ResizerInit[3:1919x1081->299x168:ArF-b] | 0.523 1.670 1.517 0.656 0.659 0.534 |  1.10  2.55  2.54  3.13 |  1.10  2.31  1.00  1.23 |
| ResizerInit[3:1919x1081->299x168:ArO-b] | 0.565 2.579 0.920 0.710 0.869 0.609 |  2.80  3.63  2.97  4.23 |  2.80  1.30  0.82  1.43 |
| ResizerInit[3:1920x1080->240x135:ArF-b] | 0.373 1.184 1.169 0.445 0.490 0.455 |  1.01  2.66  2.42  2.60 |  1.01  2.63  0.91  1.08 |
| ResizerInit[3:1920x1080->240x135:ArO-b] | 0.494 2.085 0.715 0.714 0.683 0.430 |  2.92  2.92  3.06  4.85 |  2.92  1.00  1.05  1.59 |
| ResizerInit[3:1920x1080->299x168:ArF-b] | 0.528 2.015 1.525 0.589 0.631 0.589 |  1.32  3.42  3.19  3.42 |  1.32  2.59  0.93  1.07 |
| ResizerInit[3:1920x1080->299x168:ArO-b] | 0.546 2.500 0.805 0.730 0.831 0.534 |  3.10  3.42  3.01  4.68 |  3.10  1.10  0.88  1.56 |
| ResizerInit[3:1920x1080->480x270:ArF-b] | 0.623 1.530 1.436 0.696 0.736 0.731 |  1.07  2.20  2.08  2.09 |  1.07  2.06  0.94  1.01 |
| ResizerInit[3:1920x1080->480x270:ArO-b] | 0.813 2.506 1.177 1.042 0.831 0.803 |  2.13  2.41  3.02  3.12 |  2.13  1.13  1.25  1.03 |
| ResizerInit[4:1919x1081->299x168:ArF-b] | 0.575 2.094 2.073 0.725 0.767 0.570 |  1.01  2.89  2.73  3.68 |  1.01  2.86  0.95  1.35 |
| ResizerInit[4:1919x1081->299x168:ArO-b] | 0.705 3.308 1.146 1.133 1.136 0.673 |  2.89  2.92  2.91  4.92 |  2.89  1.01  1.00  1.69 |
| ResizerInit[4:1920x1080->240x135:ArF-b] | 0.381 1.541 1.592 0.527 0.532 0.528 |  0.97  2.92  2.90  2.92 |  0.97  3.02  0.99  1.01 |
| ResizerInit[4:1920x1080->240x135:ArO-b] | 0.612 2.758 0.974 0.729 0.869 0.574 |  2.83  3.78  3.17  4.81 |  2.83  1.34  0.84  1.51 |
| ResizerInit[4:1920x1080->299x168:ArF-b] | 0.502 2.029 1.953 0.822 0.677 0.478 |  1.04  2.47  3.00  4.24 |  1.04  2.38  1.21  1.42 |
| ResizerInit[4:1920x1080->299x168:ArO-b] | 0.682 3.315 1.090 0.964 1.055 0.620 |  3.04  3.44  3.14  5.34 |  3.04  1.13  0.91  1.70 |
| ResizerInit[4:1920x1080->480x270:ArF-b] | 0.582 1.899 1.890 0.848 0.757 0.778 |  1.00  2.24  2.51  2.44 |  1.00  2.23  1.12  0.97 |
| ResizerInit[4:1920x1080->480x270:ArO-b] | 0.964 3.457 1.578 1.204 0.989 0.943 |  2.19  2.87  3.50  3.67 |  2.19  1.31  1.22  1.05 |
-------------------------------------------------------------------------------------------------------------------------------------
ermig1979 commented 2 years ago

SSE2 optimizations are only for very old machines. The extension has many restrictions so AreaFast method can be optimized only by using SSE4.1 and higher.

ermig1979 commented 2 years ago

One of the main reason why you don't have any gain might be that fact that your task is restricted by memory throughput.

mikeversteeg commented 2 years ago

I do not understand. How does this explain all other filters are much faster?

ermig1979 commented 2 years ago

It's my assumption.

mikeversteeg commented 2 years ago

With all due respect, I would argue your assumption is flawed. I see no reason why memory bandwidth required for SimdResizeMethodArea(Fast)is higher than for other filters. Also my timing graphs show there is no extra wait, in fact there is still plenty of time before the next frame requires processing. Despite my reservations I did repeat my tests using only 25% of the original resizing tasks, and there still was no difference in CPU usage between SimdResizeMethodAreaFastand SimdResizeMethodArea.

ermig1979 commented 2 years ago

It would be great if you give here example with using of SimdResizer which give the same performance result both cases (Area and AreaFast).

mikeversteeg commented 2 years ago

Yes, but I doubt that will be easy. I am testing under real life conditions using multiple resizers in various sizes, various sources, various targets (some resize to a shared target), all running in their own threads. So far I've not been able to see a performance difference. But my app and your test are very different. It remains a mystery.