NVlabs / timeloop

Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.
https://timeloop.csail.mit.edu/
BSD 3-Clause "New" or "Revised" License
325 stars 101 forks source link

Memory Temporal reduction #42

Closed egiacomin closed 4 years ago

egiacomin commented 4 years ago

Hello,

I have a quick question about the memory Temporal reduction: When looking at a basic example where we have 1MAC + 1 SRAM array, and a 1D convolution where R=3 and P=16 (https://github.com/Accelergy-Project/timeloop-accelergy-exercises/tree/master/exercises/timeloop/00-model-conv1d-1level/ref-output), how is the "Temporal reductions (per-instance)" calculated?

For instance, in (https://github.com/Accelergy-Project/timeloop-accelergy-exercises/blob/master/exercises/timeloop/00-model-conv1d-1level/ref-output/timeloop-model.stats.txt), line 100, there are 32 temporal reduction, so these memory accesses are not taken into account into the final memory accesses number. Since P=16, we will have 163 = 48 output memory writes (32 partial sums and 16 ofmaps) to the SRAM and 162=32 memory reads (for the partial sum) from the SRAM. Should not the final memory accesses number be 48+32 here? Or is there some kind of optimization I am missing?

Thanks!

angshuman-parashar commented 4 years ago

Right, so the 48 output memory writes are shown on line 98 as "scalar updates" (we use the buffets paradigm, in which memory operations are classified into fills, reads and updates). The 32 reads are shown on line 97 as "scalar reads". This number is 32 because the first update (on iteration R=0 of [0,1,2]) does not need a Psum to be read of the SRAM. Therefore, there's no temporal reduction to be carried out either, giving us 32 temporal reductions. You are correct that the final memory accesses is = updates + reads + fills = 48 + 32 + 0. So I don't see any inconsistency between your mental model and the timeloop output stats.

egiacomin commented 4 years ago

Thanks for your reply. Looks like the total number of access = Address generations + Temporal reductions, which makes sense when looking at the total Energy (per-instance).