[Feature] Rewind memory access loops

Issue Currently the loops for accessing top level variables that are automatically generated with pipelining at II=1 which is great. However, in my testing this can still lead to 10x the theoretical runtime for 2d arrays.

Solution Adding rewind to the end of the automatically generated pipeline pragmatism fully solves this performance issue while sometimes also reducing hardware usage.

Example - My matrix vector multiply program. Without rewind (current setup): 78 cycle interval for buf1

With rewind manually added: 4 cycle interval achieved.

cornell-zhang / allo

[Feature] Rewind memory access loops #143