Closed suyashbakshi closed 2 years ago
You can create a "virtual" (or "dummy") level that bypasses all the tensors and has size=0 to model such a mapping.
Could you show an example where this is important?
Thank you. Are you suggesting that the K=2 would be mapped to the dummy buffer? If so, in case there are factors of several other loops (and not just 'K'), a new dummy buffer will need to be added for each of the factor?
I do not have a good example, but I have observed that mappings with only one temporal factor in a memory level can have different (could be better or worse) energy consumption or performance than mappings with multiple temporal factors. I have been using the ZigZag framework for mapping search, and their framework allows having multiple temporal factors of the same loop within one memory level. The example Fig.6. shown in their paper has two C=2 assigned to the PE register file.
As I understand, such mappings will obviously have different tile access patterns leading to different energy consumption, stalls, utilization etc. compared to mappings with only one factor assigned to a memory level.
Well in your example I was thinking the virtual level would be a child of the L1, and so K=3 (the inner loop) would be assigned to it while the other loops would be assigned to the L1. You do not need an additional virtual level for each dimension -- a single virtual buffer gives you a full new block of loops with 1 loop for each dimension. If for some reason you need 3 loops for the same dimension, then you'll have to instantiate another virtual level.
I would encourage you to come up with exactly 1 counter-example where >1 temporal factor would be beneficial, and the benefit cannot be captured by an equivalent 1-factor mapping. Use the simplest architecture you can to trigger the phenomenon (e.g., 1 DRAM + 1 buffer + 1 arithmetic would be great if you can demonstrate it there). Similarly, use the simplest possible problem shape -- can you show it with a vector-scalar multiply? If not, then maybe a vector dot-product? A 1D-convolution? Etc. The simplest and smallest possible example to show the phenomenon. That is the best tool to help convince not just yourself but others as well.
Thank you. I guess I'm making some mistake when implementing dummy buffer, since timeloop-model complains about it. Here's the architecture with a dummy buffer added as a child to globalbuffer.
architecture:
version: 0.2
subtree:
- name: System
local:
- name: MainMemory
class: DRAM
attributes:
sizeKB: 102400
word-bits: 8
subtree:
- name: Chip
attributes:
technology: 40nm
local:
- name: GlobalBuffer
class: SRAM
attributes:
depth: 1728
width: 256
block-size: 32
word-bits: 8
- name: DummyBuffer
class: SRAM
type: bypass
attributes:
sizeKB: 0
word-bits: 8
bypass: [Inputs, Weights, Outputs]
subtree:
- name: PE
local:
- name: RegisterFile[0..255]
class: regfile
attributes:
depth: 672
width: 8
block-size: 1
word-bits: 8
- name: MACC[0..255]
class: intmac
attributes:
datawidth: 8
And an example mapping. Even with no temporal factors assigned to dummy buffer, the error below complains about dummy buffer not having enough memory.
mapping:
- target: MainMemory
type: temporal
factors: N=1 C=2 K=8 R=5 S=5 Q=1 P=1
permutation: NCKRSQP
- target: GlobalBuffer
type: temporal
factors: N=1 C=1 K=1 R=1 S=1 Q=1 P=1
permutation: NCKRSQP
- target: DummyBuffer
type: temporal
factors: N=1 C=1 K=1 R=1 S=1 Q=1 P=1
permutation: NCKRSQP
- target: DummyBuffer
type: spatial
factors: N=1 C=12 K=1 R=1 S=1 Q=13 P=1
permutation: NCKRSQP
- target: RegisterFile
type: temporal
factors: N=1 C=8 K=24 R=1 S=1 Q=1 P=13
permutation: CNKRSQP
Problem dimension:
problem:
shape:
name: "CNN-Layer"
dimensions: [ R, S, P, Q, C, K, N ]
coefficients:
- name: Wstride
default: 1
- name: Hstride
default: 1
- name: Wdilation
default: 1
- name: Hdilation
default: 1
data-spaces:
- name: Weights
projection:
- [ [C] ]
- [ [K] ]
- [ [R] ]
- [ [S] ]
- name: Inputs
projection:
- [ [N] ]
- [ [C] ]
- [ [R, Wdilation], [P, Wstride] ] # SOP form: R*Wdilation + P*Wstride
- [ [S, Hdilation], [Q, Hstride] ] # SOP form: S*Hdilation + Q*Hstride
- name: Outputs
projection:
- [ [N] ]
- [ [K] ]
- [ [Q] ]
- [ [P] ]
read-write: True
instance:
C: 192
K: 192
R: 5
S: 5
Q: 13
P: 13
N: 1
And the output of timeloop-model is:
execute:/usr/local/bin/accelergy arch.yaml map.yaml prob.yaml --oprefix timeloop-model. -o ./ > timeloop-model.accelergy.log 2>&1
ERROR: couldn't map level DummyBuffer: mapped tile size 22584 exceeds buffer capacity 0
For your other comment, I have not yet been able to come up with a simple example using 1D conv to show the benefits of having >1 temporal factors. I will try to post one soon.
Bypassing is a mapping directive.
- target: DummyBuffer
type: bypass
bypass: [Inputs, Weights, Outputs]
Thank you for pointing that out. I'm getting energy consumption results, but want to verify that I'm defining mapping correctly. I'd appreciate if you could please let me know:
Architecture:
architecture:
version: 0.2
subtree:
- name: System
local:
- name: MainMemory
class: DRAM
attributes:
sizeKB: 102400
word-bits: 8
subtree:
- name: Chip
attributes:
technology: 40nm
local:
- name: GlobalBuffer
class: SRAM
attributes:
depth: 1728
width: 256
block-size: 32
word-bits: 8
- name: DummyBuffer
class: SRAM
attributes:
sizeKB: 0
subtree:
- name: PE
local:
- name: RegisterFile[0..255]
class: regfile
attributes:
depth: 672
width: 8
block-size: 1
word-bits: 8
- name: MACC[0..255]
class: intmac
attributes:
datawidth: 8
Mapping:
mapping:
- target: MainMemory
type: temporal
factors: N=1 C=128 K=4 R=5 S=5 Q=1 P=1
permutation: NCKRSQP
- target: GlobalBuffer
type: temporal
factors: N=1 C=1 K=1 R=1 S=1 Q=1 P=1
permutation: NCKRSQP
- target: DummyBuffer
type: bypass
bypass: [Inputs, Weights, Outputs]
- target: DummyBuffer
type: temporal
factors: N=1 C=2 K=2 R=1 S=1 Q=1 P=1
permutation: NKCRSQP
- target: DummyBuffer
type: spatial
factors: N=1 C=12 K=1 R=1 S=1 Q=13 P=1
permutation: NCKRSQP
- target: RegisterFile
type: temporal
factors: N=1 C=4 K=24 R=1 S=1 Q=1 P=13
permutation: CNKRSQP
On a side note, how can I check individual component's (both memory and arithmetic) energy consumption. I know timeloop-model dumps the data about components used in the architecture, but is there a way to get information of components not used in the architecture.
I'm not sure what you mean by "components not used in the architecture". For example, you can instantiate SRAMs of various widths, heights, banks etc. Are you looking for a dump of every possible configuration? I'm not sure how that would be useful, let alone doable, especially for the more sophisticated Accelergy plugins (e.g., Aladdin, Cacti) .
No, sorry. I was looking for a sort of "catalog" of different architecture components, with their associated access energy cost, bandwidth etc, from which one can pick and choose components to use for implementing hardware architecture. Thanks to the tutorial, I found them. The ERT's get generated via estimation plugins by Accelergy. That answers my questions for now. Thanks for all the help.
Hello, is there a way to define a mapping with multiple factors of a loop assigned to the same temporal level?
For example, say loop 'K' has range = 6, for which the two factors are '3' and '2'. And at some temporal level (say L1) in the hardware architecture, I want to define a mapping as:
in which case the mapping would look like: