matmul4 cost is increased about 15% (and 10% for matmul1) after load vectorize is applied.
The main reason is that variables are assigned to the stack due to a lack of registers.
Temperature and load/store stack costs are increased a lot.
Because vectorize pass sinks all loads use, they take registers for longer period.
InlinerPass should be removed, and an outliner pass is needed to increase performance.
input8 costs
sprint 2
Returned: 0
Cost: 69154896.7062
Max heap usage (bytes): 240000
Problem
matmul4
cost is increased about 15% (and 10% formatmul1
) after load vectorize is applied. The main reason is that variables are assigned to the stack due to a lack of registers. Temperature and load/store stack costs are increased a lot.Because vectorize pass sinks all loads use, they take registers for longer period.
InlinerPass
should be removed, and an outliner pass is needed to increase performance.input8 costs
temp: 31535562.6993 addsub: 19174294.3978 muldiv: 5442088.0000 load_stack: 5343756.0000 load_heap: 3040000.0000 store_stack: 1812512.0000 logical: 1337705.6000 store_heap: 1003200.0000 comp: 168833.0000 brcond_true: 84336.0000 bruncond: 67656.0000 brcond_false: 63399.0000 read: 40002.0000 write: 30000.0000 vstore_heap: 11520.0000 malloc: 24.0000 call_arg: 4.0000 ret: 2.0000 call: 2.0000 vtemp: 0.0000 vstore_stack: 0.0000 vload_stack: 0.0000 vload_heap: 0.0000 ternary: 0.0000 switch: 0.0000 free: 0.0000 cool: 0.0000
Returned: 0 Cost: 79827059.6099 Max heap usage (bytes): 240000
temp: 40608343.5996 addsub: 19120534.3978 load_stack: 6563756.0000 muldiv: 5428890.0000 load_heap: 3001600.0000 store_stack: 2315012.0000 logical: 1314185.6000 store_heap: 1003200.0000 comp: 168833.0000 brcond_true: 84336.0000 bruncond: 67656.0000 brcond_false: 63399.0000 read: 40002.0000 write: 30000.0000 vstore_heap: 11520.0000 vload_heap: 5760.0000 malloc: 24.0000 call_arg: 4.0000 ret: 2.0000 call: 2.0000 vtemp: 0.0000 vstore_stack: 0.0000 vload_stack: 0.0000 ternary: 0.0000 switch: 0.0000 free: 0.0000 cool: 0.0000