[ ] compiler pass for tensor transposition and packing (necessary for generating high performance SA)
[ ] user API to offload HCL stage to SA (i.e., .systolic())
In this PR, we will not introduce the advanced feature of using .parallel and .to to generate SA with custom topology. Instead, we only do some very simple analysis on the stage offloaded to AutoSA and choose the default space-mapping scheme and configurations including SIMD.
This PR will introduce
In this PR, we will not introduce the advanced feature of using .parallel and .to to generate SA with custom topology. Instead, we only do some very simple analysis on the stage offloaded to AutoSA and choose the default space-mapping scheme and configurations including SIMD.