jitter matrix processing direction has large cpu/memory cache contention and overhead

the min matrix operator reads the calculate direction variable 2,075,759 times too many on each matrix calculation. It can read it once per matrix calculation.

When the matrix is processed a call is made to the subclass calc_cell() method for each cell. The order in which the cells are iterated will be one of the options provided here [by m_direction]

For example at: https://github.com/Cycling74/min-api/blob/32449860b7929bd822cf543c551a9f5fb9fad6e5/include/c74_min_operator_matrix.h#L283-L291

This variable and its access is not thread-safe
If changed during the middle of an ndim parallelized action, the effects are indeterminate.
To get this direction value, the CPU must deference a pointer and read the variable (2 + up to framewidth) times on every section (up to every row) of every ndim parallelized block for every matrix frame.

As an example, its possible for a single frame, for the current min-api codebase to access m_direction the following number of times for an HD color rgba image.

(2+framewidth=1922 potential accesses each jit_calculate_vector section) x (1080 for every HD row) = 2,075,760 times

That's very poor. 😞 Because its a single value that only needs to be read 1 time from the class variable on each frame.

And worse, this class member variable access is spread across multiple cpu caches. And since it is a read/write dereferenced variable and could be changed by any thread running on any cpu on any core, then the cache is constantly thrashing making access slow.

There are multiple possible improvements, and combinations of them can be used.

make direction compile time. It is a subset of externals that need to process cells in a matrix in dynamically runtime changing calculation order.
make direction set on the class. For that subset of externals that need a dynamically runtime changing calculation order and further that subset of customers using that subset of externals that choose a non-standard direction, let them set it as an argument on the max object, or a read-only attribute. Then it is a const set at class construction.
copy direction from that class's member variable to an ndim/thread const local parameter/struct. This means that the direction can not change during the scope of a single matrix_calc. And when the value is copied into a param/struct for each ndim section, this allows for cache locality on the cpu that is running that ndim thread.

Cycling74 / min-api

jitter matrix processing direction has large cpu/memory cache contention and overhead #151