jitter matrix processing not inlined, inefficient, millions of jumps/calls per frame

The most important part of a jitter external is the matrix calculation loop. Full stop. The min-api is not designed to be efficient in this most important area. Critical matrix calculation functions are not inlined and instead have multiple millions (for an HD frame) of expensive calls to/from for every frame's matrix calculation.

Writing C/C++ code for jitter objects is a technical endeavor. Otherwise, a person should use (gen). I request the Cycling74 team do a thorough design review of min-api to surface these performance flaws.

Context

The matrix calc function in a jitter external is called once per frame. That function loops over all dimensions of input and output matrices to transform (or generate) the output matrices. It is common for such looping to be sliced into chunks (aka "ndim") and to spawn threads for each ndim chunk. And then each thread loops over its chunk.

Naturally, there is great need for the loops to be optimized, access to memory kept thread local in all cases possible, and for code to be local to the thread and not jump to/from other places in memory.

Modern compilers statically analyze code and functions to optimize the output executable. One optimization is to reduce code duplication by putting functions as independent units, and then each calling site calls/jumps to that shared function. Another optimization is to "inline" a function at the calling site. Inlining improves speed at the cost of larger code/memory size. Why consider to inline or not a function? Because jumps/calls are very expensive. This is just two of many optimizations compilers do and the approaches compilers take are based on the compiler's static analysis. And this analysis is specific to the function and the calling site. Meaning, it is possible for the compiler to inline a function in one calling location and not inline that same function in another calling location.

C/C++ coders can give a "hint" to the compiler to suggest the compiler inline or not inline a function. This is only a hint. Nothing guaranteed and it can't be forced. This is by design. https://docs.microsoft.com/en-us/cpp/cpp/inline-functions-cpp?redirectedfrom=MSDN&view=vs-2019 and https://en.cppreference.com/w/cpp/language/inline

There is a rumor that compilers tend to not inline functions longer than a few lines. That their analyzer's cost/benefit analysis determines that the size of all the functionality overwhelms the benefit of an unknown speed improvement. Since the analysis is static, the compiler can not verify this...its just an algorithmic guess. I can find nothing in the cpp specs to support this, however, there is a general "feeling" in the c++ community that for functions to be successfully inlined, they need to be very short and focused.

Setup

VS Community 2019 v16.6.0
max api commit 1c06b88fc30da0931b6dfa4a874bc081afe00926
min api commit 210d5da4b9a3ebc201b4be5c1f0733e3110b1993
min.jit.stencil.cpp from the min-devkit

For the reader that doesn't have experience with MAP files, read https://www.codeproject.com/Articles/3472/Finding-Crash-Information-Using-the-MAP-File
https://docs.microsoft.com/en-us/cpp/build/reference/decorated-names?view=vs-2019
https://flylib.com/books/en/4.441.1.87/1/
https://stackoverflow.com/questions/1902976/msvc-any-way-to-check-if-function-is-actually-inlined

Repo

Edit your harness's CMakeLists.txt to add the /MAP parameter to target_link_options().
Build stencil with cmake build variant Release
Examine the generated min.jit.stencil.map

Result

Core functions of the matrix calculation are not inlined. Instead, they are independent functions with multiple call sites jumping/calling into them. This is terribly inefficient and detrimental to a performant jitter external. Here is proof of two functions jit_calculate_vector, calc_cell

 0001:00013d90       ??$jit_calculate_vector@Vjit_stencil@@E$0A@@min@c74@@YAXPEAU?$minwrap@Vjit_stencil@@X@01@AEBVmatrix_info@01@JJPEAUt_jit_op_info@max@1@2@Z 0000000180014d90 f i min.jit.stencil.cpp.obj
 0001:00014bf0       ??$jit_calculate_vector@Vjit_stencil@@H$0A@@min@c74@@YAXPEAU?$minwrap@Vjit_stencil@@X@01@AEBVmatrix_info@01@JJPEAUt_jit_op_info@max@1@2@Z 0000000180015bf0 f i min.jit.stencil.cpp.obj
 0001:000160c0       ??$jit_calculate_vector@Vjit_stencil@@M$0A@@min@c74@@YAXPEAU?$minwrap@Vjit_stencil@@X@01@AEBVmatrix_info@01@JJPEAUt_jit_op_info@max@1@2@Z 00000001800170c0 f i min.jit.stencil.cpp.obj
 0001:000177f0       ??$jit_calculate_vector@Vjit_stencil@@N$0A@@min@c74@@YAXPEAU?$minwrap@Vjit_stencil@@X@01@AEBVmatrix_info@01@JJPEAUt_jit_op_info@max@1@2@Z 00000001800187f0 f i min.jit.stencil.cpp.obj
 0001:00018da0       ??$calc_cell@E$00@jit_stencil@@QEAA?AV?$array@E$00@std@@V12@AEBVmatrix_info@min@c74@@AEAVmatrix_coord@45@@Z 0000000180019da0 f i min.jit.stencil.cpp.obj
 0001:00018ec0       ??$calc_cell@E$03@jit_stencil@@QEAA?AV?$array@E$03@std@@V12@AEBVmatrix_info@min@c74@@AEAVmatrix_coord@45@@Z 0000000180019ec0 f i min.jit.stencil.cpp.obj
 0001:00018fe0       ??$calc_cell@H$00@jit_stencil@@QEAA?AV?$array@H$00@std@@V12@AEBVmatrix_info@min@c74@@AEAVmatrix_coord@45@@Z 0000000180019fe0 f i min.jit.stencil.cpp.obj
 0001:000190e0       ??$calc_cell@M$00@jit_stencil@@QEAA?AV?$array@M$00@std@@V12@AEBVmatrix_info@min@c74@@AEAVmatrix_coord@45@@Z 000000018001a0e0 f i min.jit.stencil.cpp.obj
 0001:00019200       ??$calc_cell@N$00@jit_stencil@@QEAA?AV?$array@N$00@std@@V12@AEBVmatrix_info@min@c74@@AEAVmatrix_coord@45@@Z 000000018001a200 f i min.jit.stencil.cpp.obj

Here are two of those function names undecorated by undname.exe

void __cdecl c74::min::jit_calculate_vector<class jit_stencil,unsigned char,0>(struct c74::min::minwrap<class jit_stencil,void> * __ptr64,class c74::min::matrix_info const & __ptr64,long,long,struct c74::max::t_jit_op_info * __ptr64,struct c74::max::t_jit_op_info * __ptr64)

public: class std::array<unsigned char,1> __cdecl jit_stencil::calc_cell<unsigned char,1>(class std::array<unsigned char,1>,class c74::min::matrix_info const & __ptr64,class c74::min::matrix_coord & __ptr64) __ptr64

Expected

The entire ndim section of code for a matrix calculation to be contiguous, no jumps/calls to functions, and thread local storage whenever technically possible.

Workarounds

As min-api is currently designed, there is no quick fix. The segmentation of jitter functionality has lead to a cascade of functions calling to/from and results in significant overhead.

It may be possible (no guarantee) to request the compiler inline functions. For example:

min-api headers could declare inline on jitter member function prototypes like calc_cell so that when consumers of min-api write their own calc_cell they are forced to also inline. But remember...the compiler doesn't guarantee to actually inline it.
min-api headers could definitely declare inline on their own internal functions like jit_calculate_ndim, jit_calculate_ndim_loop, jit_calculate_vector, etc. in a similar hope that the compiler will inline them.
probably use additional compiler specific specifiers like __forceinline and __attribute__((always_inline)) as more hope and stronger hints to the compiler. But remember...it is still not guaranteed.

Do these work? Not really. The compiler failed to inline when I tried inline in the headers for jit_calculate_vector and stencil's calc_cell and get_cell.

Using __forceinline failed as it couldn't inline all the matrix calc related functions. For example:

[build] C:\repos-nobackup\min-api\include\c74_min_operator_matrix.h(474) : warning C4714: function 'void __cdecl c74::min::jit_calculate_ndim<class jit_stencil,0>(struct c74::min::minwrap<class jit_stencil,void> * __ptr64,long,long * __ptr64,long,struct c74::max::t_jit_matrix_info * __ptr64,unsigned char * __ptr64,struct c74::max::t_jit_matrix_info * __ptr64,unsigned char * __ptr64)' marked as __forceinline not inlined
[build] C:\repos-nobackup\min-api\include\c74_min_operator_matrix.h(513) : warning C4714: function 'void __cdecl c74::min::jit_calculate_ndim_single<class jit_stencil,0>(struct c74::min::minwrap<class jit_stencil,void> * __ptr64,long,long * __ptr64,long,struct c74::max::t_jit_matrix_info * __ptr64,unsigned char * __ptr64)' marked as __forceinline not inlined

The Microsoft doc https://docs.microsoft.com/en-us/cpp/cpp/inline-functions-cpp?redirectedfrom=MSDN&view=vs-2019 warns that recursive functions have limitations. Perhaps those two functions are ok to not be inlined since in the parallel ndim scenario that is the level of chunk work.

Removing __forceinline on those two but keeping the force on the others, resulted in compile that did not have entries in the MAP file for jit_calculate_vector, jit_calculate_ndim_loop, calc_cell, get_cell, jit_matrix_docalc, perhaps others.

Keep in mind that I see many class member functions that have no specifier on them and they are defined in the class definitions. This means they are implicitly inlined as if the inline specifier was on them. But still no guarantee. Therefore, compiler specific specifiers like __forceline can be used on them to increase the chances they will be inlined.

matrix_coord is such a critical class that it should be inlined. Yet, the compiler didn't do it...

 0001:00001df0       ??0matrix_coord@min@c74@@QEAA@JJ@Z 0000000180002df0 f i min.jit.stencil.cpp.obj
 0001:00001e20       ?x@matrix_coord@min@c74@@QEBAJXZ 0000000180002e20 f i min.jit.stencil.cpp.obj
 0001:00001e30       ?y@matrix_coord@min@c74@@QEBAJXZ 0000000180002e30 f i min.jit.stencil.cpp.obj

I was able to have matrix_coord inlined if I used __forceinline on the constructor, x(), and y(). Perhaps in_cell() and in_pixel() should be done the same.

Also keep in mind that a static function name will not appear in a MAP file. To see if such a function has been inlined, you must examine the assembly at the calling location.

To my knowledge, the only guarantee for code to be contiguous and no jumps is when code is fully defined all within one function. That is...a single all inclusive ...calculate_ndim()

Cycling74 / min-api