flame / blis

BLAS-like Library Instantiation Software Framework
Other
2.29k stars 366 forks source link

Review obj_t-related stack consumption #577

Open hominhquan opened 2 years ago

hominhquan commented 2 years ago

BLIS internal layers are mostly re-cloning and re-aliasing obj_t a, b, c each time (bli_?_front, bli_l3_thread_entry, bli_gemm_int as well as bli_?_blk_var?). This increases the management overhead (obj_t aliasing) and consumes a lot of stack, which can be problematic on memory-constrained platforms.

Can we take a look if some cloning logic can be relaxed (between multi-threading isolation (must clone) and self-execution of each thread (only clone if required by algorithms)) ?

fgvanzee commented 2 years ago

Some of this will naturally be addressed when @devinamatthews obviates the need for the bli_gemm_int() function, which is on his docket. But yes, we do a lot of aliasing under the assumption that it's cheap.

We could probably get by with aliasing each matrix obj_t only once, near the very top of the call stack.

devinamatthews commented 2 years ago

Not if you want to be able to use task-based parallelism... However, only aliasing two of the three matrices in each gemm variant is sufficient. This is maybe 30-40% of the current number of aliases?

hominhquan commented 2 years ago

Some of this will naturally be addressed when @devinamatthews obviates the need for the bli_gemm_int() function, which is on his docket

+1 @devinamatthews As I can see, there is also some aliasing in bli_?_front, bli_l3_thread_entry, and bli_?_blk_var?.