KULeuven-MICAS / zigzag

HW Architecture-Mapping Design Space Exploration Framework for Deep Learning Accelerators
https://kuleuven-micas.github.io/zigzag/
MIT License
90 stars 31 forks source link

Edge TPU example #26

Closed chensy7 closed 8 months ago

chensy7 commented 8 months ago

Hello Zigzag team,

I'm going through the Edge TPU example and trying to understand the details of the hardware-arch/mappings in Zigzag. I have a couple of questions on that.

I'm using this mapping found by Zigzag for Layernode 13 of Resnet-18. Edge_TPU-resnet18-layer_LayerNode_13_complete.json With that, I created the following nested loops based on my understanding of the loop ordering/blocking found by Zigzag. The _s loops correspond to the spatial mapping and will be unrolled if in an HLS setting. I also commented the memory hierarchy next to the highest-level loop each one serves.

image

Based on the hardware architecture description, this is what I think it should look like:

image

(1). Is this understanding of the Zigzag outputs correct? (2). In the .json file, temporal mapping for each operands has N values (each value then contain some loops and their sizes). N seems to correspond to the number of mem hiers for that operand. Is this correct? If so, why does the spatial mapping have N+1 values? (3). Based on what's generated, I understand that for the RF_1B for weights, it is essentially a register for the single weight spatially reused by each 4x4 MAC array unit. However, I don't understand the reuse pattern for RF_2B for outputs. It seems to me that C (input channel) is unrolled so that outputs can be accumulated across input channels and stored in RF_2B, but RF_2B is only large enough to store 1 output, so it has to be constantly updated. Am I missing something? (4). In Edge TPU example, lines 29-41, the RF_2B is defined as size 16 with two read ports and two write ports, could you explain what this is trying to describe? It seems to me that this is just a 16b register (in FFs). Why would it have 2 ports for read and write?

Sorry for the long post and thanks in advance for your time!

asyms commented 8 months ago

Hi, thanks for your questions and apologies for the delay in response.

  1. Yes, your understanding is correct.
  2. N is indeed the number of levels in the memory hierarchy for that operand. Spatially there is an additional bottom-most level which represents the unrolling of the MAC itself.
  3. Your drawing is correct for the W RF, however it's incorrect for O. Let me call the RF for output RF_O. Your drawing represents the served_dimensions = {(0, 1, 0, 0), (0, 0, 1, 0), (0, 0, 0, 1)} for RF_O, as one RF_O is serving all MACs across D2, D3, D4. In the actual memory hierarchy definition, a single RF_2B is only serving D2, and as such there will be D1 D3 D4 instances across the entire hierarchy. This is difficult to visualize, as there are 4 dimensions which in your drawing you have flattened to 2. Depending on the temporal mapping, it is possible that the same output can be reused.
  4. I think you're correct that RF_O doesn't need 2 read and 2 write ports. @LY-Mei any thoughts on this?

Let me know if you have additional questions!

Kind regards, Arne

chensy7 commented 8 months ago

Arne,

Thanks for getting back to me. I think I understand now. I somehow missed that RF_2B is only serving D2 so there are a lot more instances than I initially thought. An updated visualization would maybe look something like this? image

Best, Siyuan

asyms commented 8 months ago

Yes, your new visualization is more accurate. Each RF_2B (O) will be serving across dimensions D1, D3 and D4, so in total there will be D1D3D4 instances in the entire hierarchy. Also keep in mind that in a real architecture you would have some form of reduction (adder tree) from those MACs to the RF for output, which is abstracted out in ZigZag. The write bandwidth of the RF will determine the amount of outputs that can be reduced in a single clock cycle and will limit the spatial unrolling onto the physical dimensions accordingly.