Closed chensy7 closed 8 months ago
Hi, thanks for your questions and apologies for the delay in response.
Let me know if you have additional questions!
Kind regards, Arne
Arne,
Thanks for getting back to me. I think I understand now. I somehow missed that RF_2B is only serving D2 so there are a lot more instances than I initially thought. An updated visualization would maybe look something like this?
Best, Siyuan
Yes, your new visualization is more accurate. Each RF_2B (O) will be serving across dimensions D1, D3 and D4, so in total there will be D1D3D4 instances in the entire hierarchy. Also keep in mind that in a real architecture you would have some form of reduction (adder tree) from those MACs to the RF for output, which is abstracted out in ZigZag. The write bandwidth of the RF will determine the amount of outputs that can be reduced in a single clock cycle and will limit the spatial unrolling onto the physical dimensions accordingly.
Hello Zigzag team,
I'm going through the Edge TPU example and trying to understand the details of the hardware-arch/mappings in Zigzag. I have a couple of questions on that.
I'm using this mapping found by Zigzag for Layernode 13 of Resnet-18. Edge_TPU-resnet18-layer_LayerNode_13_complete.json With that, I created the following nested loops based on my understanding of the loop ordering/blocking found by Zigzag. The _s loops correspond to the spatial mapping and will be unrolled if in an HLS setting. I also commented the memory hierarchy next to the highest-level loop each one serves.
Based on the hardware architecture description, this is what I think it should look like:
(1). Is this understanding of the Zigzag outputs correct? (2). In the .json file, temporal mapping for each operands has N values (each value then contain some loops and their sizes). N seems to correspond to the number of mem hiers for that operand. Is this correct? If so, why does the spatial mapping have N+1 values? (3). Based on what's generated, I understand that for the RF_1B for weights, it is essentially a register for the single weight spatially reused by each 4x4 MAC array unit. However, I don't understand the reuse pattern for RF_2B for outputs. It seems to me that C (input channel) is unrolled so that outputs can be accumulated across input channels and stored in RF_2B, but RF_2B is only large enough to store 1 output, so it has to be constantly updated. Am I missing something? (4). In Edge TPU example, lines 29-41, the RF_2B is defined as size 16 with two read ports and two write ports, could you explain what this is trying to describe? It seems to me that this is just a 16b register (in FFs). Why would it have 2 ports for read and write?
Sorry for the long post and thanks in advance for your time!