NVlabs / timeloop

Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.
https://timeloop.csail.mit.edu/
BSD 3-Clause "New" or "Revised" License
303 stars 99 forks source link

Imperfect Factorization (Ruby) usage and status #227

Closed MustafaFayez closed 8 months ago

MustafaFayez commented 9 months ago

How do I use the imperfect factorization feature in timeloop-model/mapper? is there an example constraints set for it? also is it available in the master branch now?

angshuman-parashar commented 9 months ago

You should be able to invoke Ruby by setting mapspace:template to ruby instead of the default uber. It should work with v3.x and master, but it hasn't been tested with the latter.

MarkHoreni commented 8 months ago

I can provide an example below with a very basic architecture to use with mapper:

Give architecture:

  version: 0.3
  subtree:
    - name: System
      attributes:
        datawidth: 16
        word-bits: 16
        technology: 32nm
      local:
        - name: DRAM
          class: DRAM
          attributes:
            type: LPDDR4
      subtree:
        - name: ws
          local:
            - name: GlobalBuffer
              class: storage
              subclass: SRAM
              attributes:
                depth: 1024
                width: 256
                word_bits: 16
                block_size: 1
            - name: WeightRes[0..2]
              class: storage
              subclass: SRAM
              attributes:
                depth: 1024
                width: 16
                word_bits: 16
                block_size: 1
            - name: MAC[0..2]
              class: compute
              subclass: intmac
              attributes:
                datawidth: 8

you can construct constraints:

mapspace:
  template: ruby
  targets:
    - target: GlobalBuffer
      type: temporal
      permutation: MCRSPQN
    - target: GlobalBuffer
      type: spatial
      split: 10
      factors: M=1 S=1 R=1 P=1 Q=1 N=1
      remainders: 3
      permutation: MCRSPQN
    - target: DRAM
      type: temporal
      permutation: MCRSPQN
    - target: WeightRes
      type: temporal
      permutation: MCRSPQN
      factors: C=1 M=1 S=1 R=1 P=1 Q=1 N=1
    - target: GlobalBuffer
      type: bypass
      keep: []
      bypass: [Weights, Inputs, Outputs]
    - target: DRAM
      type: bypass
      keep: [Weights, Inputs, Outputs]
    - target: WeightRes
      type: bypass
      keep: [Weights, Inputs, Outputs]
      bypass: []

and problem:

problem:
  instance:
    C: 10
    Hdilation: 1
    Hstride: 1
    M: 10
    N: 1
    P: 1
    Q: 1
    R: 1
    S: 1
    Wdilation: 1
    Wstride: 1
    densities:
      Inputs:
        density: 1
        distribution: fixed
      Outputs:
        density: 1
        distribution: fixed
      Weights:
        density: 1
        distribution: fixed
  shape:
    coefficients:
    - default: 1
      name: Wstride
    - default: 1
      name: Hstride
    - default: 1
      name: Wdilation
    - default: 1
      name: Hdilation
    data-spaces:
    - name: Weights
      projection:
      - - - C
      - - - M
      - - - R
      - - - S
    - name: Inputs
      projection:
      - - - N
      - - - C
      - - - R
          - Wdilation
        - - P
          - Wstride
      - - - S
          - Hdilation
        - - Q
          - Hstride
    - name: Outputs
      projection:
      - - - N
      - - - M
      - - - Q
      - - - P
      read-write: true
    dimensions:
    - C
    - M
    - R
    - S
    - N
    - P
    - Q
    name: CNN-Layer

This should optimize a dataflow that, when trying to minimize cycles, should only take 40 cycles, as opposed to 50 cycles if done with uber.