Multiple factors of a loop at same temporal level

suyashbakshi commented 2 years ago

Hello, is there a way to define a mapping with multiple factors of a loop assigned to the same temporal level?

For example, say loop 'K' has range = 6, for which the two factors are '3' and '2'. And at some temporal level (say L1) in the hardware architecture, I want to define a mapping as:

== L1 memory ==
for k = 1:2
      for <some other temporal loop> (say 'C') = 1:<some loop range>
            for k = 1:3

in which case the mapping would look like:

- target: L1_memory
    type: temporal
    factors: K=2 K=3 C=<some loop range>
    permutation: KCK

angshuman-parashar commented 2 years ago

You can create a "virtual" (or "dummy") level that bypasses all the tensors and has size=0 to model such a mapping.

Could you show an example where this is important?

suyashbakshi commented 2 years ago

Thank you. Are you suggesting that the K=2 would be mapped to the dummy buffer? If so, in case there are factors of several other loops (and not just 'K'), a new dummy buffer will need to be added for each of the factor?

I do not have a good example, but I have observed that mappings with only one temporal factor in a memory level can have different (could be better or worse) energy consumption or performance than mappings with multiple temporal factors. I have been using the ZigZag framework for mapping search, and their framework allows having multiple temporal factors of the same loop within one memory level. The example Fig.6. shown in their paper has two C=2 assigned to the PE register file.

As I understand, such mappings will obviously have different tile access patterns leading to different energy consumption, stalls, utilization etc. compared to mappings with only one factor assigned to a memory level.

angshuman-parashar commented 2 years ago

Well in your example I was thinking the virtual level would be a child of the L1, and so K=3 (the inner loop) would be assigned to it while the other loops would be assigned to the L1. You do not need an additional virtual level for each dimension -- a single virtual buffer gives you a full new block of loops with 1 loop for each dimension. If for some reason you need 3 loops for the same dimension, then you'll have to instantiate another virtual level.

I would encourage you to come up with exactly 1 counter-example where >1 temporal factor would be beneficial, and the benefit cannot be captured by an equivalent 1-factor mapping. Use the simplest architecture you can to trigger the phenomenon (e.g., 1 DRAM + 1 buffer + 1 arithmetic would be great if you can demonstrate it there). Similarly, use the simplest possible problem shape -- can you show it with a vector-scalar multiply? If not, then maybe a vector dot-product? A 1D-convolution? Etc. The simplest and smallest possible example to show the phenomenon. That is the best tool to help convince not just yourself but others as well.

suyashbakshi commented 2 years ago

Thank you. I guess I'm making some mistake when implementing dummy buffer, since timeloop-model complains about it. Here's the architecture with a dummy buffer added as a child to globalbuffer.

architecture:
  version: 0.2

  subtree:
  - name: System

    local:
    - name: MainMemory
      class: DRAM
      attributes:
        sizeKB: 102400
        word-bits: 8

    subtree:
    - name: Chip
      attributes:
        technology: 40nm

      local:
      - name: GlobalBuffer
        class: SRAM
        attributes:
          depth: 1728
          width: 256
          block-size: 32
          word-bits: 8

      - name: DummyBuffer
        class: SRAM
        type: bypass
        attributes:
          sizeKB: 0
          word-bits: 8
          bypass: [Inputs, Weights, Outputs]

      subtree:
      - name: PE

        local:
        - name: RegisterFile[0..255]
          class: regfile
          attributes:
            depth: 672
            width: 8
            block-size: 1
            word-bits: 8
        - name: MACC[0..255]
          class: intmac
          attributes:
            datawidth: 8

And an example mapping. Even with no temporal factors assigned to dummy buffer, the error below complains about dummy buffer not having enough memory.

mapping:
  - target: MainMemory
    type: temporal
    factors: N=1 C=2 K=8 R=5 S=5 Q=1 P=1
    permutation: NCKRSQP

  - target: GlobalBuffer
    type: temporal
    factors: N=1 C=1 K=1 R=1 S=1 Q=1 P=1
    permutation: NCKRSQP

  - target: DummyBuffer
    type: temporal
    factors: N=1 C=1 K=1 R=1 S=1 Q=1 P=1
    permutation: NCKRSQP

  - target: DummyBuffer
    type: spatial
    factors: N=1 C=12 K=1 R=1 S=1 Q=13 P=1
    permutation: NCKRSQP

  - target: RegisterFile
    type: temporal
    factors: N=1 C=8 K=24 R=1 S=1 Q=1 P=13
    permutation: CNKRSQP

Problem dimension:

problem:
  shape:
    name: "CNN-Layer"
    dimensions: [ R, S, P, Q, C, K, N ]
    coefficients:
      - name: Wstride
        default: 1
      - name: Hstride
        default: 1
      - name: Wdilation
        default: 1
      - name: Hdilation
        default: 1

    data-spaces:
      - name: Weights
        projection:
          - [ [C] ]
          - [ [K] ]
          - [ [R] ]
          - [ [S] ]
      - name: Inputs
        projection:
          - [ [N] ]
          - [ [C] ]
          - [ [R, Wdilation], [P, Wstride] ] # SOP form: R*Wdilation + P*Wstride
          - [ [S, Hdilation], [Q, Hstride] ] # SOP form: S*Hdilation + Q*Hstride 
      - name: Outputs
        projection:
          - [ [N] ]
          - [ [K] ]
          - [ [Q] ]
          - [ [P] ]
        read-write: True

  instance:
    C: 192
    K: 192
    R: 5
    S: 5
    Q: 13
    P: 13
    N: 1

And the output of timeloop-model is:

execute:/usr/local/bin/accelergy arch.yaml map.yaml prob.yaml --oprefix timeloop-model. -o ./ > timeloop-model.accelergy.log 2>&1
ERROR: couldn't map level DummyBuffer: mapped tile size 22584 exceeds buffer capacity 0

For your other comment, I have not yet been able to come up with a simple example using 1D conv to show the benefits of having >1 temporal factors. I will try to post one soon.

angshuman-parashar commented 2 years ago

Bypassing is a mapping directive.

- target: DummyBuffer
  type: bypass
  bypass: [Inputs, Weights, Outputs]

suyashbakshi commented 2 years ago

Thank you for pointing that out. I'm getting energy consumption results, but want to verify that I'm defining mapping correctly. I'd appreciate if you could please let me know:

Architecture:

architecture:
  version: 0.2

  subtree:
  - name: System

    local:
    - name: MainMemory
      class: DRAM
      attributes:
        sizeKB: 102400
        word-bits: 8

    subtree:
    - name: Chip
      attributes:
        technology: 40nm

      local:
      - name: GlobalBuffer
        class: SRAM
        attributes:
          depth: 1728
          width: 256
          block-size: 32
          word-bits: 8

      - name: DummyBuffer
        class: SRAM
        attributes:
          sizeKB: 0

      subtree:
      - name: PE

        local:
        - name: RegisterFile[0..255]
          class: regfile
          attributes:
            depth: 672
            width: 8
            block-size: 1
            word-bits: 8
        - name: MACC[0..255]
          class: intmac
          attributes:
            datawidth: 8

Mapping:


mapping:
  - target: MainMemory
    type: temporal
    factors: N=1 C=128 K=4 R=5 S=5 Q=1 P=1
    permutation: NCKRSQP

  - target: GlobalBuffer
    type: temporal
    factors: N=1 C=1 K=1 R=1 S=1 Q=1 P=1
    permutation: NCKRSQP

  - target: DummyBuffer
    type: bypass
    bypass: [Inputs, Weights, Outputs]

  - target: DummyBuffer
    type: temporal
    factors: N=1 C=2 K=2 R=1 S=1 Q=1 P=1
    permutation: NKCRSQP

  - target: DummyBuffer
    type: spatial
    factors: N=1 C=12 K=1 R=1 S=1 Q=13 P=1
    permutation: NCKRSQP

  - target: RegisterFile
    type: temporal
    factors: N=1 C=4 K=24 R=1 S=1 Q=1 P=13
    permutation: CNKRSQP

On a side note, how can I check individual component's (both memory and arithmetic) energy consumption. I know timeloop-model dumps the data about components used in the architecture, but is there a way to get information of components not used in the architecture.

angshuman-parashar commented 2 years ago

I'm not sure what you mean by "components not used in the architecture". For example, you can instantiate SRAMs of various widths, heights, banks etc. Are you looking for a dump of every possible configuration? I'm not sure how that would be useful, let alone doable, especially for the more sophisticated Accelergy plugins (e.g., Aladdin, Cacti) .

suyashbakshi commented 2 years ago

No, sorry. I was looking for a sort of "catalog" of different architecture components, with their associated access energy cost, bandwidth etc, from which one can pick and choose components to use for implementing hardware architecture. Thanks to the tutorial, I found them. The ERT's get generated via estimation plugins by Accelergy. That answers my questions for now. Thanks for all the help.

NVlabs / timeloop

Multiple factors of a loop at same temporal level #145