The most important part of a jitter external is the matrix calculation loop. Full stop. The min-api is not designed to be efficient in this most important area. Critical matrix calculation functions are not inlined and instead have multiple millions (for an HD frame) of expensive calls to/from for every frame's matrix calculation.
Writing C/C++ code for jitter objects is a technical endeavor. Otherwise, a person should use (gen). I request the Cycling74 team do a thorough design review of min-api to surface these performance flaws.
Context
The matrix calc function in a jitter external is called once per frame. That function loops over all dimensions of input and output matrices to transform (or generate) the output matrices. It is common for such looping to be sliced into chunks (aka "ndim") and to spawn threads for each ndim chunk. And then each thread loops over its chunk.
Naturally, there is great need for the loops to be optimized, access to memory kept thread local in all cases possible, and for code to be local to the thread and not jump to/from other places in memory.
Modern compilers statically analyze code and functions to optimize the output executable. One optimization is to reduce code duplication by putting functions as independent units, and then each calling site calls/jumps to that shared function. Another optimization is to "inline" a function at the calling site. Inlining improves speed at the cost of larger code/memory size. Why consider to inline or not a function? Because jumps/calls are very expensive. This is just two of many optimizations compilers do and the approaches compilers take are based on the compiler's static analysis. And this analysis is specific to the function and the calling site. Meaning, it is possible for the compiler to inline a function in one calling location and not inline that same function in another calling location.
There is a rumor that compilers tend to not inline functions longer than a few lines. That their analyzer's cost/benefit analysis determines that the size of all the functionality overwhelms the benefit of an unknown speed improvement. Since the analysis is static, the compiler can not verify this...its just an algorithmic guess. I can find nothing in the cpp specs to support this, however, there is a general "feeling" in the c++ community that for functions to be successfully inlined, they need to be very short and focused.
Setup
VS Community 2019 v16.6.0
max api commit 1c06b88fc30da0931b6dfa4a874bc081afe00926
min api commit 210d5da4b9a3ebc201b4be5c1f0733e3110b1993
Edit your harness's CMakeLists.txt to add the /MAP parameter to target_link_options().
Build stencil with cmake build variant Release
Examine the generated min.jit.stencil.map
Result
Core functions of the matrix calculation are not inlined. Instead, they are independent functions with multiple call sites jumping/calling into them. This is terribly inefficient and detrimental to a performant jitter external. Here is proof of two functions jit_calculate_vector, calc_cell
0001:00013d90 ??$jit_calculate_vector@Vjit_stencil@@E$0A@@min@c74@@YAXPEAU?$minwrap@Vjit_stencil@@X@01@AEBVmatrix_info@01@JJPEAUt_jit_op_info@max@1@2@Z 0000000180014d90 f i min.jit.stencil.cpp.obj
0001:00014bf0 ??$jit_calculate_vector@Vjit_stencil@@H$0A@@min@c74@@YAXPEAU?$minwrap@Vjit_stencil@@X@01@AEBVmatrix_info@01@JJPEAUt_jit_op_info@max@1@2@Z 0000000180015bf0 f i min.jit.stencil.cpp.obj
0001:000160c0 ??$jit_calculate_vector@Vjit_stencil@@M$0A@@min@c74@@YAXPEAU?$minwrap@Vjit_stencil@@X@01@AEBVmatrix_info@01@JJPEAUt_jit_op_info@max@1@2@Z 00000001800170c0 f i min.jit.stencil.cpp.obj
0001:000177f0 ??$jit_calculate_vector@Vjit_stencil@@N$0A@@min@c74@@YAXPEAU?$minwrap@Vjit_stencil@@X@01@AEBVmatrix_info@01@JJPEAUt_jit_op_info@max@1@2@Z 00000001800187f0 f i min.jit.stencil.cpp.obj
0001:00018da0 ??$calc_cell@E$00@jit_stencil@@QEAA?AV?$array@E$00@std@@V12@AEBVmatrix_info@min@c74@@AEAVmatrix_coord@45@@Z 0000000180019da0 f i min.jit.stencil.cpp.obj
0001:00018ec0 ??$calc_cell@E$03@jit_stencil@@QEAA?AV?$array@E$03@std@@V12@AEBVmatrix_info@min@c74@@AEAVmatrix_coord@45@@Z 0000000180019ec0 f i min.jit.stencil.cpp.obj
0001:00018fe0 ??$calc_cell@H$00@jit_stencil@@QEAA?AV?$array@H$00@std@@V12@AEBVmatrix_info@min@c74@@AEAVmatrix_coord@45@@Z 0000000180019fe0 f i min.jit.stencil.cpp.obj
0001:000190e0 ??$calc_cell@M$00@jit_stencil@@QEAA?AV?$array@M$00@std@@V12@AEBVmatrix_info@min@c74@@AEAVmatrix_coord@45@@Z 000000018001a0e0 f i min.jit.stencil.cpp.obj
0001:00019200 ??$calc_cell@N$00@jit_stencil@@QEAA?AV?$array@N$00@std@@V12@AEBVmatrix_info@min@c74@@AEAVmatrix_coord@45@@Z 000000018001a200 f i min.jit.stencil.cpp.obj
Here are two of those function names undecorated by undname.exe
The entire ndim section of code for a matrix calculation to be contiguous, no jumps/calls to functions, and thread local storage whenever technically possible.
Workarounds
As min-api is currently designed, there is no quick fix. The segmentation of jitter functionality has lead to a cascade of functions calling to/from and results in significant overhead.
It may be possible (no guarantee) to request the compiler inline functions. For example:
min-api headers could declare inline on jitter member function prototypes like calc_cell so that when consumers of min-api write their own calc_cell they are forced to also inline. But remember...the compiler doesn't guarantee to actually inline it.
min-api headers could definitely declare inline on their own internal functions like jit_calculate_ndim, jit_calculate_ndim_loop, jit_calculate_vector, etc. in a similar hope that the compiler will inline them.
probably use additional compiler specific specifiers like __forceinline and __attribute__((always_inline)) as more hope and stronger hints to the compiler. But remember...it is still not guaranteed.
Do these work? Not really. The compiler failed to inline when I tried inline in the headers for jit_calculate_vector and stencil's calc_cell and get_cell.
Using __forceinline failed as it couldn't inline all the matrix calc related functions. For example:
[build] C:\repos-nobackup\min-api\include\c74_min_operator_matrix.h(474) : warning C4714: function 'void __cdecl c74::min::jit_calculate_ndim<class jit_stencil,0>(struct c74::min::minwrap<class jit_stencil,void> * __ptr64,long,long * __ptr64,long,struct c74::max::t_jit_matrix_info * __ptr64,unsigned char * __ptr64,struct c74::max::t_jit_matrix_info * __ptr64,unsigned char * __ptr64)' marked as __forceinline not inlined
[build] C:\repos-nobackup\min-api\include\c74_min_operator_matrix.h(513) : warning C4714: function 'void __cdecl c74::min::jit_calculate_ndim_single<class jit_stencil,0>(struct c74::min::minwrap<class jit_stencil,void> * __ptr64,long,long * __ptr64,long,struct c74::max::t_jit_matrix_info * __ptr64,unsigned char * __ptr64)' marked as __forceinline not inlined
Removing __forceinline on those two but keeping the force on the others, resulted in compile that did not have entries in the MAP file for jit_calculate_vector, jit_calculate_ndim_loop, calc_cell, get_cell, jit_matrix_docalc, perhaps others.
Keep in mind that I see many class member functions that have no specifier on them and they are defined in the class definitions. This means they are implicitly inlined as if the inline specifier was on them. But still no guarantee. Therefore, compiler specific specifiers like __forceline can be used on them to increase the chances they will be inlined.
matrix_coord is such a critical class that it should be inlined. Yet, the compiler didn't do it...
0001:00001df0 ??0matrix_coord@min@c74@@QEAA@JJ@Z 0000000180002df0 f i min.jit.stencil.cpp.obj
0001:00001e20 ?x@matrix_coord@min@c74@@QEBAJXZ 0000000180002e20 f i min.jit.stencil.cpp.obj
0001:00001e30 ?y@matrix_coord@min@c74@@QEBAJXZ 0000000180002e30 f i min.jit.stencil.cpp.obj
I was able to have matrix_coord inlined if I used __forceinline on the constructor, x(), and y(). Perhaps in_cell() and in_pixel() should be done the same.
Also keep in mind that a static function name will not appear in a MAP file. To see if such a function has been inlined, you must examine the assembly at the calling location.
To my knowledge, the only guarantee for code to be contiguous and no jumps is when code is fully defined all within one function. That is...a single all inclusive ...calculate_ndim()
The most important part of a jitter external is the matrix calculation loop. Full stop. The min-api is not designed to be efficient in this most important area. Critical matrix calculation functions are not inlined and instead have multiple millions (for an HD frame) of expensive calls to/from for every frame's matrix calculation.
Writing C/C++ code for jitter objects is a technical endeavor. Otherwise, a person should use (gen). I request the Cycling74 team do a thorough design review of min-api to surface these performance flaws.
Context
The matrix calc function in a jitter external is called once per frame. That function loops over all dimensions of input and output matrices to transform (or generate) the output matrices. It is common for such looping to be sliced into chunks (aka "ndim") and to spawn threads for each ndim chunk. And then each thread loops over its chunk.
Naturally, there is great need for the loops to be optimized, access to memory kept thread local in all cases possible, and for code to be local to the thread and not jump to/from other places in memory.
Modern compilers statically analyze code and functions to optimize the output executable. One optimization is to reduce code duplication by putting functions as independent units, and then each calling site calls/jumps to that shared function. Another optimization is to "inline" a function at the calling site. Inlining improves speed at the cost of larger code/memory size. Why consider to inline or not a function? Because jumps/calls are very expensive. This is just two of many optimizations compilers do and the approaches compilers take are based on the compiler's static analysis. And this analysis is specific to the function and the calling site. Meaning, it is possible for the compiler to inline a function in one calling location and not inline that same function in another calling location.
C/C++ coders can give a "hint" to the compiler to suggest the compiler inline or not inline a function. This is only a hint. Nothing guaranteed and it can't be forced. This is by design. https://docs.microsoft.com/en-us/cpp/cpp/inline-functions-cpp?redirectedfrom=MSDN&view=vs-2019 and https://en.cppreference.com/w/cpp/language/inline
There is a rumor that compilers tend to not inline functions longer than a few lines. That their analyzer's cost/benefit analysis determines that the size of all the functionality overwhelms the benefit of an unknown speed improvement. Since the analysis is static, the compiler can not verify this...its just an algorithmic guess. I can find nothing in the cpp specs to support this, however, there is a general "feeling" in the c++ community that for functions to be successfully inlined, they need to be very short and focused.
Setup
min.jit.stencil.cpp
from the min-devkitFor the reader that doesn't have experience with MAP files, read https://www.codeproject.com/Articles/3472/Finding-Crash-Information-Using-the-MAP-File
https://docs.microsoft.com/en-us/cpp/build/reference/decorated-names?view=vs-2019
https://flylib.com/books/en/4.441.1.87/1/
https://stackoverflow.com/questions/1902976/msvc-any-way-to-check-if-function-is-actually-inlined
Repo
/MAP
parameter to target_link_options().Release
min.jit.stencil.map
Result
Core functions of the matrix calculation are not inlined. Instead, they are independent functions with multiple call sites jumping/calling into them. This is terribly inefficient and detrimental to a performant jitter external. Here is proof of two functions
jit_calculate_vector
,calc_cell
Here are two of those function names undecorated by undname.exe
Expected
The entire ndim section of code for a matrix calculation to be contiguous, no jumps/calls to functions, and thread local storage whenever technically possible.
Workarounds
As min-api is currently designed, there is no quick fix. The segmentation of jitter functionality has lead to a cascade of functions calling to/from and results in significant overhead.
It may be possible (no guarantee) to request the compiler inline functions. For example:
inline
on jitter member function prototypes likecalc_cell
so that when consumers of min-api write their own calc_cell they are forced to also inline. But remember...the compiler doesn't guarantee to actually inline it.inline
on their own internal functions likejit_calculate_ndim
,jit_calculate_ndim_loop
,jit_calculate_vector
, etc. in a similar hope that the compiler will inline them.__forceinline
and__attribute__((always_inline))
as more hope and stronger hints to the compiler. But remember...it is still not guaranteed.Do these work? Not really. The compiler failed to inline when I tried
inline
in the headers for jit_calculate_vector and stencil's calc_cell and get_cell.Using
__forceinline
failed as it couldn't inline all the matrix calc related functions. For example:The Microsoft doc https://docs.microsoft.com/en-us/cpp/cpp/inline-functions-cpp?redirectedfrom=MSDN&view=vs-2019 warns that recursive functions have limitations. Perhaps those two functions are ok to not be inlined since in the parallel ndim scenario that is the level of chunk work.
Removing
__forceinline
on those two but keeping the force on the others, resulted in compile that did not have entries in the MAP file forjit_calculate_vector
,jit_calculate_ndim_loop
,calc_cell
,get_cell
,jit_matrix_docalc
, perhaps others.Keep in mind that I see many class member functions that have no specifier on them and they are defined in the class definitions. This means they are implicitly inlined as if the
inline
specifier was on them. But still no guarantee. Therefore, compiler specific specifiers like__forceline
can be used on them to increase the chances they will be inlined.matrix_coord is such a critical class that it should be inlined. Yet, the compiler didn't do it...
I was able to have matrix_coord inlined if I used __forceinline on the constructor, x(), and y(). Perhaps in_cell() and in_pixel() should be done the same.
Also keep in mind that a static function name will not appear in a MAP file. To see if such a function has been inlined, you must examine the assembly at the calling location.
To my knowledge, the only guarantee for code to be contiguous and no jumps is when code is fully defined all within one function. That is...a single all inclusive
...calculate_ndim()