Asd-g / AviSynth-JincResize

Jinc (EWA Lanczos) resampler.

MIT License

16 stars 2 forks source link

Faster 2D convolution approaches #3

Open DTL2020 opened 3 years ago

DTL2020 commented 3 years ago

I think of different 2D convolution approach for signigically speed-up this plugin.

One possible speed-up of 2D-convolution with typical 8 or even 10 bit unsigned integer input data: To make not kernel_x_input+output_sum but LUT addition to output sum.

But I do not know is there significant difference with todays CPUs in Mul+Add operation in compare with Add only. It looks only implementation and testing required.

Because for 2D convolution we need to multiply each kernel sample with each input sample but all possible 8bit input samples are only 256 numbers limited count - we can make pre-multiplied 256_x_kernel_size LUT and just read-index this LUT instead or Mul. For even 10 taps 2D-kernel we have 20x20x4byte_float_x_256=about 400 kbyte LUT that is good cacheable on most CPUs in season of 201x years and may be later.

So the main computational line of convolution (from C-routine) result += src_ptr[lx] * coeff_ptr[lx]; may be replaced with something like result += LUT[src_ptr[lx]]; //- no multiplication - just cache read and addition

The LUT start pointer is valid for all line of kernel so it may be calculated once per summing of full kernel line if using SIMD ASM processing.

Addition:

I think there may be 2 significally different approaches for 2D convolution. They give same output result but may be very different in speed on different platforms:

Each output sample got kernel-weighted and input-area covered by kernel_size (filter size/support size) sum.
Each input sample 'casts' (add) kernel weighted by input sample to output buffer.

The 1. needs significant memory-read traffic (to both input buffer and kernel buffer) and produces very small output write traffic to memory (write once - may be uncached). Kernel buffer is read-only and can be easily shared between all cores in multi-core processing. Input memory buffer is also read-only and cached memory image may be shared by many cores proceses neibour input samples. For LUT using I still not understand if it can be used in this approach.

The 2. produces very small read memory traffic for input buffer and produces read_(mul+)addwrite traffic for output buffer. If this traffic is good cashed - the actual memory writes depends on CPU memory manager. This approach allows to use LUT for weighting by small number of input variants kernel buffer. But for multi core processing it mostly require each core process far enough input and output memory arrays because for read(mul+)add_write memory access to output buffer may reqiuire many resources to keep cache coherence between cores. Also this approach allows for easy skip zero input samples processing with simple compare_and_continue. Because zero input sample makes all-zero kernel addition to output buffer and do not changes it.

May it good to test both approaches on todays hardware platforms to compare its processing speed.

As I see from C-resampler subroutine it uses 1. approach.

Asd-g commented 3 years ago

Can you create pull request?

DTL2020 commented 3 years ago

This is just an ideas to the future and requires a lot of coding (for me) and debugging to attempt to make a version to arbitrary ratio resize as with current JincResize project. I will try to do it later. At first I need to test it at my small separate project with integer ratio resizing (i.e. x2, x3 etc). May be it will be at first separate branch of processing code - i.e. current resampler for any float resize ratio and separate resampler for integer ratio,

Also an idea about decreasing memory requirement for LUT: Because 2D kernel usually have 2 symmetry axies - vertical and horizontal, so it is in theory possible to keep in memory only about 1/4 of kernel samples and get other 3/4 by mirroring vertically and horizontally. At least it is easy to make with rows symmetry - just read from 1st row to the middle row of kernel and in reverse order. To use columns symmetry to make mirroring it is require to transpose/shuffle numbers in registers I think because reading memory in reverse order may be very slow. So it may be applicable only to small enough kernel sizes fitting in registers. Also it may be slower in compare with reading from main memory.

DTL2020 commented 3 years ago

Here is current work in progress on additive LUT-based 2d convolution with Jinc (EWA Lz) kernel available - https://github.com/DTL2020/2D_resampler . It also have mul+add without LUT convolution loop example. I still poor in Avisynth plugin interface for input from Avisynth and output to Avisynth buffers so it uses libtiff and tiff files for input and output. It is only short demo on the additive LUT 2D convolution approach and it do not process edges of buffer correctly. So the test images must be placed closer the center.

It now can use Jinc (EWA Lz) kernel and shows results close to this plugin resizer. Compare is in attach. jinc_sinclin2_2d7

Test script: [code] Loadplugin("ResampleMT.dll") LoadPlugin("JincResize.dll")

function Ast2(clip c, int isize) { return Subtitle(c, "7",font="Arial",size=isize,x=10,y=20,halo_color=$FF000000, text_color=$00e0e0e0) }

BlankClip(100,200,180,"RGB24",25,color=$00202020)

Animate(last, 0,100,"Ast2", 45, 180)

AddBorders(150,140,100,104,color=$00202020)

ConvertToYV12()

SinPowResizeMT(last,width/4,height/4,p=2.7)

return last

SincLin2ResizeMT(width4,height4,taps=20)

Jinc256Resize(width4,height4) [/code]

DTL2020 commented 3 years ago

Well - finally some pull-request created. It have still many things to go: 1. Find out why OpenMP of addition loop fails with x64 build with VS2019. Tested with Intel C++ compiler - Version 2021.1 Build 20201112_000000 - #pragma omp in line 889 JincResize.cpp may be enabled and worked OK. So may be some settings in VS2019 may be tweaked. 2. Improve ASM summation functions for different SIMD instructions sets. 3. Make kernel half-height for less memory requirement so larger kernels will fit in CPU cache. 4. May be make kernel's lines sort of run-lengh encoded to possibly save about 27% of memory space and ADD operations because about 27% or round kernel are zeroes and not required for keeping in memory and addition to ouput sum. 5. Add downscaling function. 6. Align kernels rows to 32-bytes boundary so aligned read instructions may be used for a bit faster reading.

Asd-g commented 3 years ago

The pull request merged to branch master-1. I made some changes so now building with OpenMP is possible, the leftover floats (ASM) are processed too.

Quick test with this image - https://i.imgaa.com/2021/01/04/5ff26063d2a7f928883289.png

jincresize(width*10, height*10, tap=10, opt=0, threads=1):
- v2.0.1 - 0.707 fps, long initial time, 3 GB RAM;
- new - 0.193 fps, 280 MB RAM.
jincresize(width*10, height*10, opt=0, threads=1):
- v2.0.1 - 3.47 fps, 260 MB RAM;
- new - 5.29 fps, 90 MB RAM.
jincresize(width*10, height*10, tap=10, opt=2, threads=0):
- v2.0.1 - 22.45 fps, long initial time, 3 GB RAM;
- new - 2.27 fps, 205 MB RAM.
jincresize(width*10, height*10, opt=2, threads=0):
- v2.0.1 - 182.65 fps, 260 MB;
- new - 52.17 fps, 90 MB.

DTL2020 commented 3 years ago

It is sad even new 2020 year C compilers looks like have too few AI to good put to ASM so simple addition task. The one line intrinsic looks like compiled into processing in 1 SIMD register only too. So we have about 7 of 8 (or may be 15 of 16 at some platforms) SIMD registers free and unused. I think of accumulating more sums for output row in SIMD registers from neibough input samples before writing to output buffer. The shift between start points of addition of kernels rows to output row = iMul float32 and with small enough practical multiplication values like upsampling HD to UHD = 2 of SD to HD/UHD = 2 to 4 the shift is 8 to 16 bytes. So we can decrease number of reads and writes to output buffer. It going to work fast enough untill kernel LUT array fits in part of CPU cache. For practical using of upsampler I think typically iMul < iTaps and looks like iMul 2..4 or may be 8 and Taps are 4 to 10 and more. I will try to make test ASM subroutine of row addition for 2 input sample kernels at single processing loop and iMul=2 to look at its performance.

DTL2020 commented 3 years ago

Some work in progress with working demo with fixed size of kernel - https://github.com/DTL2020/AviSynth-JincResize/tree/master-1 . The c-function also works, but without significant improving. Main is AVX2 version now. It is not finished debug even current realization of 2-samples convolution at one horizontal pass. With upsize of 1920x1080 frame to 2x with taps=8 or 4x with taps=4 it shows performance improves about 50% over old large-ram kernel processing. Current investigation results: Attempt of decreasing workutins for threads do not works as expected. The fastest is MT at different and far parts of input and output buffers. The reading 256-images LUT works a bit slower in compare with 1 kernel multiplied to input sample at processing time. The most performance boot is increasing number of input samples processing without writing to memory - in the number of SIMD registers. The simple 'unrolling' summation loop with just loading all SIMD registers available adds very few to performance like 5%. But transfer from 1 to 2 input samples processind adds about +50%. The increasing number of input samples gives less addition to performance but still do - transfer from 2 to 3 samples adds about 25..20% performance. Current version shows how to use 2 samples. I think the using of AVX512 registers will also make good addition to performance, but I still have not access to such CPU for checking design and without debug it may still contain errors. The complex convolution work of loading most available SIMD registers and process 3 and more samples in 1 pass is much more complex in compare with 1 line of SIMD intrinsics of previous versions. Attempt to decrease round kernel processing add a bit to C-subroutine but looks like not good for SIMD subroutines because of any intervention in highly-loaded SIMD processing loop greatly decreases performance. This is also hard task for coder because it looks the fastest possible SIMD subroutines may be made only for fixed sizes of iKernelSize. May be approach to process with decreasing workload for each loop will work a bit acceptable. Like at first process workunit of largest loadable to available SIMD registers, later 1/2 and later 1/4 etc. The more we can load in large AVX512 registers and the greater difference with 'last' part of workload being processed now with 'non-SIMD' code. I tried to make 'function-call' looks like inline with different workunits processed by SIMD functions - it is void(JincResize:: AVX2Row)(int64_t k_col_x, float pfProc, float pfCurrKernel_pos, float pfSample); function. But looks it is not inlined but called as function and it greatly degrade performance. So at today only hardcoded ASM into main processing loop works fast enough. Will try to fix AVX2 and other ASM subroutines in the next releases so it can work with different kernel-sizes. Because curent work-in-progress works with only very limited combinations of iMul and iTaps I think it is a bit early for pull-request. If anyone can help with debug and improving ASM SIMD functions I can make pull-request if needed. As for intrinsincs ASM vs hardcoded separate ASM functions I think the C-text intrinsincs will be faster in performance because optimizing compiler may do a good job on better arranging of resulted CPU commands. Though writing intrinsincs text may be harder in compare with pure asm-text. Also using intrinsincs in C-text may be allow to faster loops without call-ret interaction with external subroutine. Without putting into ASM external function larger part of code. Addition: I found FMA instructions and they are exactly for this task. Will test it too.

DTL2020 commented 3 years ago

I have uploaded some more work in-progress to master-1 branch. Tested FMA instruction with memory operand. It looks to get the highest possible Gflops from cpu we need to make FMA only with registers arguments. Because even L1 data access takes about 5 cycles and FMA engines (2 per typical core in the current CPUs) can supply 2x256 bits FMA results each cycle. So in theory typical 3 GHz CPU with 16 fmaops per cycle and 4 cores can provide 192 Gfmaops. For 4x upsize with 4taps (32x32 floats kernel) and 1920x1080 frame size in YV12 we need about 3 Gfmapos. So 190 Gfmaops CPU have to run at about up to 63 fps. With /2 loss on multithreading and /2 loss on other processing - about 15 fps. As for AVX512 I not sure if it can provide 2x512 FMA per cycle - may be limited to 2x256 or 1x512. To make processing in-between registers of long kernels and >1 input samples we need some way of shifting data in a number of SIMD registers while shifting in floats and shifted data must go in and from edges of register. I not found still how to do it fast enough. One easy way was to use memory-addressing operand for any iMul value with 1-byte granularity - but it need to touch L1-cache frequently that looks like lowers performance. The only integer 'shift' with simply register renaming is iMul=8. But I think 8x upsizing is not very practical processing. There is another approach of 'shift' via store-load operation. Because it may be performed less rare in compare with input samples processing it looks adds to performance significally. Also Intel C compiler looks like understands and put some further optimisation to it. Unfortunately MSVC compiler do not. It can be seen in ASM output of compilers too. May we can looking at Intel C compiler output move some more intrinsics to C code and make this technique available for MSVC compiler too and may be even remove some intermediate store commands too. In current sources (master-1 branch at fork) the function JincResize::KernelRow_avx2_mul4_taps4 shows this approach. It process 8 input samples at 1 loop pass and process 4 odd samples with 8-step ymm registers renaming and then store(addr)-load(addr+4float) and 4 even samples. It is still not good debugged for correct output math but shows about 4+x speed up in compare with 'large-kernel' resampler from resize_plane_avx2.cpp at processing images about 1920x1080 or 960x540 sized (upsize 4x with 4taps). Will try to debug it and make close to pull request later. As highest performance subroutines looks like may be made only for exact combination of upsize ratio and taps I think of simple adding some named resize functions like 'JincUpsize64_4x()' . The using of AVX512 zmm 512bits registers will allow to use larger kernels and/or more samples in 1 pass procesing. Current bad thing - Intel C compiler really builds slower resize_plane_avx2.cpp binary but much more optimized KernelRow_avx2.cpp binary and MSVC vice-versa. So the solution may be to build .lib with different compilers and link. Though with OpenMP it may cause additional troubles. Also it looks using FMA instead of add+mul instruction cause some bug with OpenMP at the borders of processed parts of buffer - where output result of threads overlaps - will try to looks at it later. It is very strange.

DTL2020 commented 3 years ago

Finally some new pull-request available. It looks like good compiled with MSVS2019 and produces close speed binary with Intel C Compiler. It was significant error in operands order to FMA instructions in previous commit so it looks like it causes so great difference between MSVC and Intel C compilers. It have one main subroutine for mul=4 taps=4 more optimized AVX2+FMA processing. At tests with 960x540 and 1920x1080 YV12 inputs it shows about 2 to 4 times faster processing in compare with old large-kernel way. Unfortunately AVX2 instructions do not provide easy way to add mul=2,3,5, and other integer shift between ymm registers fast enough as _mm256_permute2f128_ps can. Or it need to be tested in the future. The much better is AVX512 instructions even in most common AVX512F set and many more better performing combinations of mul+taps may be made with AVX512 in the future. I do not have such CPU for now but will try to upgrade. Next versions I will try to add mul=2 version with SSE instructions becase shufps allow to exchange between halfs of different registers. As testing shows with high mul value like 8 and higher it works slower in compare with old large-kernel variant. So it good to try to make new subroutines for small enough mul values like 2 to may be 6. May be AVX512 will give better performance in the future. Also it contains some unfinished in debugging subroutine KernelRow_avx2_mul4_taps4_fr - it attempt to walk in convolution via full output row that can decrease DRAM page switching and may be store-load operations. Will try to finish it later to test on speed difference. Also there attempts to make some approach on FMA with memory operands - function KernelRow_avx2_mul2_taps8. It require separate prepared kernel buffer with padded rows to allow read with some advance or may be shifted kernel buffers to load with +stride for each new input sample. I think it will be slower in compare of in-between SIMD registers FMA but it allow to use more mul+taps combinations on AVX+FMA processors without AVX512. Will try to test it in the future too. Also as KernelRow processing become fast enough I see simple conversion from float32 to uint8 output become very slow so it now put on AVX2 also and need to be done at SSE and AVX512 in future versions. The main processing may be easy adapted to uint16 and especially float in-out because internal is done in float32 and may be float32 in-out will be even faster. With old large-kernel way I see enabling float32 in-out slow downs a bit for unkonwn reason because it decreases number of conversion opertaions. May be accessing memory of 4-times larger for float32 is so slow.

DTL2020 commented 3 years ago

Some more strategic idea: Currently we have some tactic speed increases but come close to the limit of FMA ops by host CPU per second. The currently required FMA ops is about (iMul_x_iTaps_x_2)^2 per each input sample. It may be decreased a bit (like 10% in SIMD to may be about 30% in C-code with current optimization of kernel_row_useful_range table) skipping outside zeroes around round kernel inside square buffer but it is not very significant. When I look at current full row-walking processing it seems like applying FIR filter in 1D space. With size of kernel iTaps_x_iMul_x_2. But there may be possible to calculate equal or acceptably closer looking IIR filter kernel for the same or acceptably same output result. But IIR filter kernel will have less number of coefficients so it may will require less FMA ops to compute. At least for rows 1D processing. So required number of FMA ops to compute full 2D resize will be decreased to about (iMul_x_iTaps_x_2_x_C_IIR), where C_IIR is fixed number of IIR filter coefficients for rows processing and C_IIR < iTaps_x_iMul_x_2. May be it also possible to perform 2D/planar IIR filtering/processing with further reducing of FMA ops required per each input sample instead of current 2D FIR processing, but it may be slow performance because of image scanning placement in memory. Will try to start test this idea from calculating IIR filter coefficients from current kernel rows and to look if errors will be with small enough number of coefficients.

DTL2020 commented 3 years ago

Uploaded some supplementary repository https://github.com/DTL2020/FIR_to_IIR . Unfortunately with number of IIR coefficients about 1/2 of FIR the quality of 'non-covered' by IIR samples inpulse response is still poor and error is at some samples significally > 100%. And if attempt to use more coefficients - the used solver fails for unknown reason. May be because of limited precision of float calculations in solver though it uses double-pr. May be need to use another way of converting FIR to IIR or may be add some refining algorithm with acceptable speed for refining coefficients of IIR impulse response at the required edges. Anyway will try to do version of resampler with IIR processing to look at actual image processing results even with current level of errors from required kernel samples. And to test its speed.

DTL2020 commented 3 years ago

Strategic announce: Commit to https://github.com/DTL2020/2D_resampler a version of resampler with significally decreased memory use. It now use only Output_width_x_iKernelSize temp buf (per each thread in future MT/OMP) so the whole processing will fit into CPU cache at least L2/L3. This approach may have some difficulties for multi-threading processing at the edge rows for each thread - but I think it may be easy enough workarounded. Will try to add to Avisynth master-1 branch when I will have access to its build enviroment a few days later. With SIMD versions I will test for non-cached load and non-cached store of input and output data if it will add more speed because of preventing cache pollution of read-once and write-once data. Also this version uses output of finally processed rows from main processing loop - so the additional pass to main memory for conversion from float to output uint8 also removed.

DTL2020 commented 3 years ago

Well - finally some milesone in performance: Function KernelRowAll_avx2_mul4_taps4_cb running as single thread and with multithreading frame-based in Avisynth+ reach about 33% of theoretical G-FMAs performance of CPU like i3-9100T and i5-9600K. It allow to upsize FullHD 1920x1080 YV12 frame to 8K with about 45fps at i5-9600K running at 4.3 GHz in 6 threads. The most performance boost in compare with previous AVX FMA 8-samples engine looks like reached because of smaller memory using with 'circulating buf' of size only out_width_x_KernelSize instead of full output frame size in floats. With C-code upsample using of circulating buf makes very small difference may be because of too slow CPU processing without good AVX FMA engine. I think of making multithreaded of the '_cb' functions in the future - I still poor in C++ so need to found how to make array of vectors for each thread. Also some work need for aligning edges of processed frame stripes by each thread. Also some more ideas for future optimizations may add to performance. I made some attempt to release binaries at https://github.com/DTL2020/AviSynth-JincResize/releases/tag/v0.3-alpha for testing.

DTL2020 commented 3 years ago

Some note: Looks like found way of faster enough shift float32 with a sequence of SIMD registers on any number of floats. To implement multi-input-samples per pass convolution for multiplication with 2,3,5 ratio etc. For AVX2 it is sequence of 2 instructions - _mm256_permutevar_8x32_ps for rotating floats inside ymm for required number of floats and _m256_blend_ps for transfer shifted-out floats to another ymm. And for AVX512F+AVX512VL there is one instruction for permute any exact float32s between 2 zmm registers - _mm512_permutex2var_ps . I hope there will be many 'client' CPUs with AVX512F+AVX512VL instruction set available soon. It looks we need to add more 'opt' parameter values to separate opt=3 AVX512F from AVX512F+AVX512VL CPU features required. May be opt=4 or 5.

Asd-g commented 3 years ago

I did quick speed test with 1920x1080 YV12 clip, CPU 7900x, VM with 16 threads + 16 GB RAM:

JincResize(7680, 4320, tap=4, opt=1, ap=1):

Frames processed:                   120 (0 - 119)
FPS (min | max | average):          5.454 | 12.46 | 11.90
Process memory usage (max):         394 MiB
Thread count:                       66
CPU usage (average):                70.0%

JincResize(7680, 4320, tap=4, opt=1, ap=0):

Frames processed:                   101 (0 - 100)
FPS (min | max | average):          6.673 | 10.52 | 9.977
Process memory usage (max):         1245 MiB
Thread count:                       66
CPU usage (average):                83.5%

JincResize(7680, 4320, tap=4, opt=1, ap=2) + prefetch(8):

Frames processed:                   121 (0 - 120)
FPS (min | max | average):          1.108 | 263158 | 12.10
Process memory usage (max):         1475 MiB
Thread count:                       44
CPU usage (average):                51.9%

The test was done with -timelimit=10 (every test run lasts 10s). Every ap=x was tested twice and the run with the higher fps was picked. opt=1 was used because opt=2/3 + ap=1 gives wrong output.

Here the same frame with different opt/ap options (top left subtitle): https://0x0.st/-H7U.png https://0x0.st/-H7G.png https://0x0.st/-H7D.png https://0x0.st/-H7k.png https://0x0.st/-H7d.png https://0x0.st/-H7n.png

Edit: the test was done with v0.3-alpha release.

DTL2020 commented 3 years ago

Heh - the message for e-mail was for unknown reason sorted to spam-box by gmail.com. The currently fastest functions works only with AVX2 for mul4 and mul2 (in latest commits to master-1 branch). Also I usually use multipliers in arguments for less probability of error in width and height calculations. So the better test strings for testing currently fastest versions is

JincResize(width*2, height*2, opt=2, ap=2) and 
JincResize(width*4, height*4, opt=2, ap=2)

. I not test with exact 7680, 4320 numbers as input. Yes - the using of ap=1 gives distortions at some combinations of other args - need to fix this bug or if ap=2 being converted to internal multithreading gives every time better results in speed so ap=1 processing will be removed from future versions. As a developer it better to use fresh builds of maser-1 branch and test via debugger if code really uses the currently fastest functions KernelRowAll_avx2_mul4_taps4_cb() for iMul=4 and iTaps=4 and KernelRowAll_avx2_mul2_taps4_cb() for iMul=2 and iTaps=4. The start of resizer in JincResize::JincResize() have large selector-code for use many of designed resizers so it may fails to select proper function if some bug in detection/selection logic still exist. Also it looks auto-multithreading by OpenMP sometime fails with correct processing of rows stripes and gives errors (green lines) at the start or end of stripe. So it is better to make hand-writed buffers managment for multithreading. Also using OpenMP ofcourse for smaller codesize. Simply use #pragma parallel (num_threads.. etc) {} on block of parallel code instead of using parallel for {} with auto-management of loop variable. And instead of parallel by loop-variable use omp_get_thread_num() to get thread_id and make hand-writing striping of processed buffer. It anyway required for the smallest memory using _cb functions because inability of auto-aligning of edges of row-striped buffer by OpenMP logic then using 'circulating buffer'. Will try to do it later a bit. I made test with many cores Xeon (like 18) with Hyperthreading - it looks using 'HT' cores is bad idea so the best Prefetch() param is only up to real cores. Also its performance looks like memory-limited so FullHD_to_8K upsampling still about 45fps even with 18 threads at about 2 GHz Xeon. Consumer-grade i7 at 4+ GHz make this with 6 threads.

https://0x0.st/-H7U.png - yes - that small green lines at the left looks like bug caused by auto-multithreading by OpenMP. So I think to put efforts for manual memory management in multithreading of current faster 'ap=2 / circulating buffer'.

https://0x0.st/-H7k.png - that build looks was with disabled AVX512 processing (I do not have AVX512 at my build-machines and my old Visual Studios like 2015 can not compile AVX512 intrisincs) with just returns in calls. So opt=3 do not perform actual processing. Latest commits to master-1 have AVX512 fixed, but need to be build with AVX512 define. I use Intel C++ command line compiler to build with AVX512 so add /DAVX512 to command line.

https://0x0.st/-H7d.png - the Avisynth command at the left looks correct for using faster multi-sample processing. May be check in debugger for using KernelRowAll_avx2_mul4_taps4_cb() for actual processing.

The test was done with -timelimit=10 (every test run lasts 10s). - may be extend limit to about 1 minute - the Avisynth+MT is slow at startup as I see.

CPU 7900x, If it is i9-7900X https://ark.intel.com/content/www/us/en/ark/products/123613/intel-core-i9-7900x-x-series-processor-13-75m-cache-up-to-4-30-ghz.html it is great CPU even for for now with some Server-class features like 2x512 FMA units per core so with 10 real cores should run JincResize(width_x_4, height_x_4, taps=4, opt=2, ap=2) with 8..10 Prefetch() for 8K output YV12 at about 100 fps. May be VM (virtual mashine ?) make some interference in performance ? The memory speed have to be enough - to store 8K YV12 at 100 fps it is required only about 5 GBytes/sec and this CPU rated to 80+ GBytes/s. And 'circulating buffers' possibly will fit in the cache. For 7680 out width x 4 bytes float x KernelSize 32 = 983 kbytes of 'circ buf' per internal buffer per each thread - 13 MB L3 cache will fit 10 threads buffers (though it may conflict with load and store streams in some way).

DTL2020 commented 3 years ago

Finally added internal multithreading to _cb family of functions. It actually uses a bit more FMA for correct aligning of rows stripes in compare with 1 threaded so it is currenlty if threads=1 used old 1-threaded and if threads > 1 used new multithreaded versions. AVX512 build also available using intel C compiler. I add new build release at https://github.com/DTL2020/AviSynth-JincResize/releases/tag/v0.4-alpha . I made some tests at different CPUs with ImageReader source and ConvertToYV12() before upsize. Also last tests with faster source - BlankClip of 1920x1080. Looks ImageReader and ConvertToTV12() running with 1 thread avisynth+ is too slow for even 1920x1080 source feeding. Results file attached. It currently uses vector of vectors for storing and rotating memory pointers for rows temp buf but it looks uses vectors copy at function calls. If someone will help to rewrite for using pointers to vectors instead of vectors copying I think it will be a bit faster. I still too poor in C++ to convert it to vectors pointers at JincResize::ConvertiMulRowsToInt_avx2(std::vector<float>Vector, int iInpWidth, int iOutStartRow, unsigned char dst, int iDstStride) call. Also I not sure if it needed or not delete internal vectors defined (created as new) at https://github.com/DTL2020/AviSynth-JincResize/blob/19c636dfefe7971722c99beefd54ddb21fd0c67e/src/JincResize.cpp#L691 at class delete at https://github.com/DTL2020/AviSynth-JincResize/blob/19c636dfefe7971722c99beefd54ddb21fd0c67e/src/JincResize.cpp#L1499 (if uncomented it creates error on exit somhere in vector class). Currently it may cause some memory leak if not deleted correctly.

res.txt

Asd-g commented 3 years ago

It currently uses vector of vectors for storing and rotating memory pointers for rows temp buf but it looks uses vectors copy at function calls. If someone will help to rewrite for using pointers to vectors instead of vectors copying I think it will be a bit faster. I still too poor in C++ to convert it to vectors pointers at JincResize::ConvertiMulRowsToInt_avx2(std::vector<float>Vector, int iInpWidth, int iOutStartRow, unsigned char dst, int iDstStride) call.

You're probably looking for JincResize::ConvertiMulRowsToInt_avx2(std::vector<float*>const& Vector, int iInpWidth, int iOutStartRow, unsigned char* dst, int iDstStride).

Also I not sure if it needed or not delete internal vectors defined (created as new) at https://github.com/DTL2020/AviSynth-JincResize/blob/19c636dfefe7971722c99beefd54ddb21fd0c67e/src/JincResize.cpp#L691 at class delete at https://github.com/DTL2020/AviSynth-JincResize/blob/19c636dfefe7971722c99beefd54ddb21fd0c67e/src/JincResize.cpp#L1499 (if uncomented it creates error on exit somhere in vector class). Currently it may cause some memory leak if not deleted correctly.

That vector ( https://github.com/DTL2020/AviSynth-JincResize/blob/19c636dfefe7971722c99beefd54ddb21fd0c67e/src/JincResize.cpp#L691 ) should be deleted after https://github.com/DTL2020/AviSynth-JincResize/blob/19c636dfefe7971722c99beefd54ddb21fd0c67e/src/JincResize.cpp#L698 in the same scope. Does this variant have performance penalty:

std::vector<float*> ThreadRowPointers;
ThreadRowPointers.reserve(iKernelSize);

for (int i = 0; i < iKernelSize; i++)
{
    ThreadRowPointers.push_back((float*)_mm_malloc(iWidthEl * (iMul + 1)  * sizeof(float), 32)); // +1 looks like buf overrun somwhere
}

vpvThreadsVectors.push_back(ThreadRowPointers);

DTL2020 commented 3 years ago

Thank you. It looks to work ok without visible performance penalty. I found some wierd feature - as well as intel C compiler suggests to put most used 'global' class members 'iMul/iTaps/iKernelSize' to the beginning because it improves data locality - I found making local copy of these constants for processing or even replacing to hardcoded constants where possible significally improves performance (like about +20%). So I started long process of re-write most or all of KernelRow* functions to have these control values function arguments (where needed). So it will take some time for next updates. Also found the way of improving balance of 'scatter/gather' processing - as it done with 'horizontal multi-samples' processing it may partially be done with 'multi-rows vertical' data gathering without additional memory read-write transactions. Though it significally increases code size and complexity. I currently made some _r2 functions (2 rows per 1 loop pass processing and per registers load-store operation). It adds about +20% performance (to pure C and to AVX2 versions). So may be will try to make _r4 (4 rows) later to see if it will add some more. Already it reach about 50% of G-FMA performance or 4 core FMA2x256 CPU. So it unlikely can reach many more. Currently unfinished debug for aligning multithreaded versions of _r2 functions uploaded to my fork repository. Single threaded works.