AviSynth / AviSynthPlus

AviSynth with improvements
http://avs-plus.net
963 stars 73 forks source link

CPUDevice->Allocate for large-tr temporal filtering #351

Open DTL2020 opened 1 year ago

DTL2020 commented 1 year ago

Currently CPUDevice->Allocat() uses simple new method of memory allocation.

It is https://github.com/AviSynth/AviSynthPlus/blob/45404659f0fed07e369978924efd417221bdc782/avs_core/core/DeviceManager.cpp#L180

(If I found this place correctly - the Asd-g tell current AVS provide upto AVX512 64-bytes aligned start addresses for each row ?)

It is not guaranteed to be SIMD aligned for SSE/AVX/AVX512 ? But also it is not guaranteed to be safe for long-tr temporal processing filters with many sequential frames parallel access for reading. If the reading addresses will overload number of ways in cache set it will lost performance significantly. Typical number of ways in cache sets of L1 is about 8 so with tr>4 in some large-tr temporal filter may cause already possible performance penalty. The example of bad allocation start address is 4-KBytes page-aligned allocations.

2 possible solutions exist:

  1. Fast but poorer in non-supervision controllable memory allocation - allocate more RAM in some number of cachelines max and add random offset of cacheline sized to each start address. It not require tracking of all used start address offsets but can hit same used addresses because random number generator can return equal numbers. Also some address space is poorerly used (maybe more RAM waisted). It is solution used in large-page mod - https://github.com/DTL2020/AviSynthPlus-LP-mod/blob/62d7c268f279d4f6917b323e6806082847603b67/DeviceManager.cpp#L198

  2. Make some global class in AVS core to watch for the sequential frames starting addresses allocations and simply advance with cacheline size each next frame starting address allocation and reset addition value to zero after all required addresses are used (the total max offset can be calculated from current CPU cache params like number of ways and cache size and cacheline size). So all sequential frames in AVS environment will have guaranteed offset to hit different cache sets and the probability of performance penalty for large-tr temporal filters will be probably removed completely.

This hardware-limit issue also described in Intel Software Optimization manual: 3.6.7 Capacity Limits and Aliasing in Caches There are cases in which addresses with a given stride will compete for some resource in the memory hierarchy. Typically, caches are implemented to have multiple ways of set associativity, with each way consisting of multiple sets of cache lines (or sectors in some cases). Multiple memory references that compete for the same set of each way in a cache can cause a capacity issue. There are aliasing conditions that apply to specific microarchitectures. Note that first-level cache lines are 64 bytes. Thus, the least significant 6 bits are not considered in alias comparisons. 3.6.7.1 Aliasing Cases in the Pentium® M, Intel® Core™ Solo, Intel® Core™ Duo and Intel® Core™ 2 Duo Processors Pentium M, Intel Core Solo, Intel Core Duo and Intel Core 2 Duo processors have the following aliasing case: • Store forwarding — If a store to an address is followed by a load from the same address, the load will not proceed until the store data is available. If a store is followed by a load and their addresses differ by a multiple of 4 KBytes, the load stalls until the store operation completes. Assembly/Compiler Coding Rule 49. (H impact, M generality) Avoid having a store followed by a non-dependent load with addresses that differ by a multiple of 4 KBytes. Also, lay out data or order computation to avoid having cache lines that have linear addresses that are a multiple of 64 KBytes apart in the same working set. Avoid having more than 4 cache lines that are some multiple of 2 KBytes apart in the same first-level cache working set, and avoid having more than 8 cache lines that are some multiple of 4 KBytes apart in the same first-level cache working set. When declaring multiple arrays that are referenced with the same index and are each a multiple of 64 KBytes (as can happen with STRUCT_OF_ARRAY data layouts), pad them to avoid declaring them contiguously. Padding can be accomplished by either intervening declarations of other variables or by artificially increasing the dimension. User/Source Coding Rule 8. (H impact, ML generality) Consider using a special memory allocation library with address offset capability to avoid aliasing. One way to implement a memory allocator to avoid aliasing is to allocate more than enough space and pad. For example, allocate structures that are 68 KB instead of 64 KBytes to avoid the 64-KByte aliasing, or have the allocator pad and return random offsets that are a multiple of 128 Bytes (the size of a cache line). User/Source Coding Rule 9. (M impact, M generality) When padding variable declarations to avoid aliasing, the greatest benefit comes from avoiding aliasing on second-level cache lines, suggesting an offset of 128 bytes or more. 4-KByte memory aliasing occurs when the code accesses two different memory locations with a 4-KByte offset between them. The 4-KByte aliasing situation can manifest in a memory copy routine where the addresses of the source buffer and destination buffer maintain a constant offset and the constant offset happens to be a multiple of the byte increment from one iteration to the next.

pinterf commented 1 year ago

The place you found is not where the allocation rules are ensured. It just allocates the given number of bytes (and a bit more for partly debug ('deadbeef', but yes, the +16 bytes are there in non-debug builds for some reason) and cuda ('margin') reasons)

Avisynth plus aligns pitch and frame pointers at 64 byte boundaries. I changed this at the very beginning of my involvement in the project.

This is done by setting a frame_align to 64 from FRAME_ALIGN, which is defined as 64 in avisynth.h.

When Nekopanda (in Neo branch) fixed a lot of MT issues and introduced the possibility of a CUDA-only pipeline, then it was changed not to be constant, at least not for a non-CPU device. The 'frame_align' and 'plane_align' variables were introduced then. Allocation rules for CUDA are different, 64 bytes alignment may be too much.

https://github.com/AviSynth/AviSynthPlus/blob/master/avs_core/core/avisynth.cpp#L2346

The actual alignment in a video frame, setting the beginning offset and the pitches are all done in NewVideoFrame https://github.com/AviSynth/AviSynthPlus/blob/master/avs_core/core/avisynth.cpp#L3576 or https://github.com/AviSynth/AviSynthPlus/blob/master/avs_core/core/avisynth.cpp#L3636

DTL2020 commented 1 year ago

Yes - I see the 64-bytes alignment is performed. But I still not see if the required offsets in virtual address space is added for different sequential frames of clip placement in memory to save from possible cache aliasing conditions if some processing filter do large-tr frames access (i.e. read same x,y coordinate sample or SIMD-dataword group of samples of cacheline sized 32 or 64 bytes for AVX/AVX512 from a sequence of frames +-tr from current frame).

In my experience with LargePage allocations and not adding offset for different frames allocations the performance of mvtools/MDegrainN with tr about 10 decrease at about 50%. The most suffering is MDegerainN processing because MAnalyse only process pairs of frames and MDegrainN access all tr-pooled frames in parallel. So the RAW performance impact on MDegrainN only is >50%. Now we also have vsTTempSmooth sample-based temporal processing filter with also tr of several dozens in good case (or equal to mvtools MAX_TRAD of 128).

With random start addressing for 'new' method the performance may be somehow inbetween the worst case (all frames/rows start addresses aligned in worst case for hitting same cache set) and the best (the sets of cache filled sequentially in best order).

Some calculations on typical 32 kB L1D cache and 8-ways with 64bytes cacheline size:

The cache architecture typically designed to be completely filled with sequential memory access (read for example). I.e. sequential memory reading in virtual address space (of the cache-size block ?) not cause aliasing conditions. So memcpy of _read_all_cache and _write_all_cache is sort of best performance to save from bus directions switching.

It mean sequential read of 32 kBytes of virtual addresses fill all sets of all cache. So with 64 bytes of cacheline and 8-ways - size of each set is 64x8=512bytes. Number of sets is 32768/512=64 sets. Each set can serve only 8 aliasing memory addresses (each sized of cacheline). Also it may be no good to fill all set because some other process data from unknown addresses also need to be cached. So it may be good to fill only Nways/2 in each set.

So the equal x,y coordinates of sequential frames allocations in AVS core expected to have virtual addresses that fill (hit) only about Nways/2 of each cache set when some large-tr filter process frames in 'parallel access' mode.

Also it looks the max total addition to memory allocation for frame buffer to play with offsets inside is 32 kBytes only (if optimizing for L1D cache only ?) . So it not waste too much RAM for typical frame sizes.

The MBs in size L2/L3 caches are typically 128 bytes cacheline (as intel docs says ?) and about 12-ways at poor-people chips and may be more ways at expensive server-class chips (may have 16..20..24 ways). So it is good to either optimize for both L1 and L2/L3 cache usage (if possible) or may be even make user-mode switch to optimize addressing offsets either for L1 or for L2/L3 cache to try both ways and select fastest at current user's use case of all script and all filters. May be the calculating formula for best addressing offsets will be not very simple so user may manually define CPU cache params (size, cacheline size, N-ways) for algorithm to use in memory addressing preparation in AVS core.

Software-way of getting CPU cache params may be very platform-dependent: https://stackoverflow.com/questions/794632/programmatically-get-the-cache-line-size On Linux (with a reasonably recent kernel), you can get this information out of /sys: /sys/devices/system/cpu/cpu0/cache/ This directory has a subdirectory for each level of cache. Each of those directories contains the following files: coherency_line_size level number_of_sets size ways_of_associativity

For Windows (intel architecture chips ?) may be used something around cpuid request and parsing same as SIMD info.

pinterf commented 1 year ago

I suppose this one https://github.com/DTL2020/AviSynthPlus-LP-mod/blob/62d7c268f279d4f6917b323e6806082847603b67/DeviceManager.cpp#L275 works only if granularity is exactly 2M? https://github.com/DTL2020/AviSynthPlus-LP-mod/blob/62d7c268f279d4f6917b323e6806082847603b67/DeviceManager.cpp#L249

Does it really need to use VirtualAlloc? Can it work when just using the memory allocation of the actual C++ library and provide the random offset back?

I think I slowly understand the 4k aliasing problem, This is also a good link, from which I noticed that the processing can really suffer a lot from this phenomenon. https://stackoverflow.com/questions/25774190/l1-memory-bandwidth-50-drop-in-efficiency-using-addresses-which-differ-by-4096

Aside from the fact that specific scripts such as MDegrainN over +/-5 or more frames would benefit, doesn't it hurt other use cases? Large pages support alone can give surprises, I read come considetations here as well: https://reviews.llvm.org/D58718 (And we are talking only about tuning an Avs+ running on Windows 10-11 with 16GiB+ memory?)

DTL2020 commented 1 year ago

"https://github.com/DTL2020/AviSynthPlus-LP-mod/blob/62d7c268f279d4f6917b323e6806082847603b67/DeviceManager.cpp#L275 works only if granularity is exactly 2M?"

That method of ptr = (BYTE*)((uint64_t)ptr >> 21); is simply to not store separately pointer to free memory. In better case programmer need to store several pointers to each memory allocation - the pointer to free memory region and working pointer for application. I not sure if VirtualFree with some 'random internal address of allocated memory block' will free memory region correctly.

"Does it really need to use VirtualAlloc?"

From programmer's view the use of OS-dependent VirtualAlloc is not strictly required. It is some handy method in Windows to get page-aligned pointer and also zeroed memory region (it is physically zeroed by some Windows service threads in idle system time - so returned by VirtualAlloc memory region already guaranteed to be zeroed-clean). But with C-standard (OS-independent) memory allocation it is not known starting virtual address of allocation so additional memory area size padding to play with real working address range inside allocation may be twice more in size (?).

"Can it work when just using the memory allocation of the actual C++ library and provide the random offset back?"

Yes - it can. The only issue may be a bit more RAM waisted for larger padding to have enough space to play with the allocated region for placing actual working buffer (start and end virtual addresses).

"the 4k aliasing problem,"

The aliasing problem is not only 4K - the 4K is only included in all possible aliasing cases. The best solution is to keep track of actual virtual memory to cache sets mapping and sort of having 'watch algorithm' keeping usage (filling) of cache sets at level around 50% of total used ways in each set. It is typical recommendation of intel optimizing - to optimize to about 50% of cache size. The total 32/64bit virtual address space is interleaved/striped-mapped into sets of set-associative cache so aliasing condition may happen with very 'far' addresses. Because really all 32/64bit virtual addresses are mapped to 32 kBytes L1D typical cache using some algorithm. Same is to all other cache levels. The real usable cache size for 'pathological' virtual memory access patterns reduced to size of 1 set only. It is cacheline_size_x_N-ways (IMHO). So for L1D cache with 64bytes linesize and 8ways it is 512 bytes only. And performance of total memory subsystem drops significantly. For AVX512 frames processing it mean it can handle only 8 frames in tr-pool without performance drop.

" doesn't it hurt other use cases?"

It is expected also with vsTTempSmooth and any other 'large-tr' temporal filter accessing 'too much' frames around current in 'parallel-way' (same x,y coordinates reading).

"Large pages support alone can give surprises,"

LP unlikely usable in current AVS without much more core redesign - the typical windows only reserve around 256 MBs only for LP allocations after fresh system reboot. Also Win do not keep this physical RAM contigous and it quickly become fragmented to 4K pages and not availiable to LP allocations. So Win simply return 'out of resources' error. Special kernel-mode RAM physical defragmentator service driver requied to gather physically contigous RAM again after Win running and Microsoft do not provide such driver - it is user task to create and use such driver. So it is required to design OS-startup driver to allocate required GBs of contigous physical RAM address space (using file or system registry record how much to allocate) and provide to AVS core (via DeviceIoCtl call and memory mapping into user ring3 protection mode address space) after system complete bootup. It mostly probably will be single GBs sized RAM allocation for internal AVS memory management. Also this RAM will be lost form all other Windows and processes and devices untill reboot and special RAM-grabbing driver deactivation (or maybe driver command to free mem once untill next reboot). It is special AVS use case. So that LP-mod for AVS exist only as tech-demo to make some performance tests after fresh Win reboot (better in safemode) and for plugins using only about 200 MB of RAM.

"we are talking only about tuning an Avs+ running on Windows 10-11 with 16GiB+ memory?"

Set-associative cache aliasing issues exist at any RAM size for Windows bootup. Like from 1 MB in size. Also it is CPU hardware issue - not OS-dependent. If we will have full-associative caches - these issues will be removed. But full-associative caches too expensive and resources costly (power and silicon size) so in most of general purposes CPUs we mostly probably will have only limited set-associative caches till the end of current civilization.

DTL2020 commented 1 year ago

" good link, from which I noticed that the processing can really suffer a lot from this phenomenon. https://stackoverflow.com/questions/25774190/l1-memory-bandwidth-50-drop-in-efficiency-using-addresses-which-differ-by-4096"

That issue may be a bit differently if only access 1 reading location and 1 writing. It may be about described in intel manual about read-write with some bad stride: Store forwarding — If a store to an address is followed by a load from the same address, the load will not proceed until the store data is available. If a store is followed by a load and their addresses differ by a multiple of 4 KBytes, the load stalls until the store operation completes.

This use case is about reading many locations in parallel for data analisys. If reading virtual addresses form some 'pathological pattern' - the performance of memory subsystem drops because cache stop serve memory read requests fast enough (also the hardware prefetchers can not fill cache at the best speed for pre-reading if all read requests maps to the same cache set and quickly exhaust number of avaialble 'ways').

DTL2020 commented 1 year ago

Sort of short description of issue: For set-associative cache computing platforms many different ranges of virtual addresses of total virtual address space mapped to single cache set. Even if all resources of single cache set are busy (full) - the mapping can not be changed in runtime to other free cache sets (it will degrade chip performance for cache search algorithm). 'Old' still useful cache data in the set is simply replaced with 'new' - may be least recently used replacement algorithm. It is soft of short description of 'cache aliasing issue'.

Programming task for memory manager designer for many read/(write) streams processing environments: adjust virtual addresses of different read/write streams so they will map to different cache sets. So each read/write stream of many parallel streams can be served by individual cache set and total performance of memory subsystem is maximum.