NVlabs / timeloop

Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.
https://timeloop.csail.mit.edu/
BSD 3-Clause "New" or "Revised" License
336 stars 102 forks source link

Output tiles #123

Open nivi1501 opened 2 years ago

nivi1501 commented 2 years ago

Hi, I wish to keep track of all the output tiles which are being written to the main memory. Basically, I want to:

  1. Assign a tile number to each output tile.
  2. Increment the tile number whenever the output tile is updated (suppose I have 3 input tiles, which updates the output tile)
  3. Then the tile number of the output tile will be 3. Will it be possible to perform such an analysis using TL? How?
angshuman-parashar commented 1 year ago

What if the same tile is written multiple times (i.e., read-modified-updated)? Do you want to count those as multiple tile writes, or do you only want to count the number of distinct tiles? In either case you just want the total number (and are not trying to generate a trace of labelled tile writes), correct?

nivi1501 commented 1 year ago

I am trying to generate a trace of labeled tile writes and reads (from DRAM to Global buffer). Basically, I am trying to study the DRAM to Global buffer traffic at the tile level. I wish to generate traces similar as:

(TileID) (Number of elements in a tile) (Type of access (R/W)) T1 512 R T2 1024 W .... .... T2 1024 W ... ...

I am already familiar with how to estimate tile sizes and the total number of tiles using Timeloop. I just wish to know if I can generate a similar trace file using TL and if it is possible, then I should focus on which source file to generate this. Any help in this matter will be highly appreciated. Looking forward to your reply.

angshuman-parashar commented 1 year ago

Try the tracing feature. It will emit a trace of the axis-aligned hyper-rectangles that the nest analysis visits at each coordinate in space-time.

You will also have to disable temporal (and maybe spatial) extrapolation. Note that this will massively slow down simulation speed. This is because with extrapolation disabled Timeloop starts behaving more like a cycle-level simulator than a fast analytical model. You should also probably only use this with timeloop-model on a specific mapping. Using tracing with the mapper will just generate a ton of noise that's hard to deal with.

To enable all this, set the following env variables:

TIMELOOP_ENABLE_TRACING=1
TIMELOOP_DISABLE_TEMPORAL_EXTRAPOLATION=1
TIMELOOP_DISABLE_SPATIAL_EXTRAPOLATION=1

and then run timeloop-model as you normally do.

The trace output will look something like this:

    t/7/ s/0/ Weights: { [0,0,0,0:2,256,1,1), } Inputs: { [0,0,0,14:1,2,8,28), } Outputs: { [0,0,14,0:1,256,28,8), } 
      t/8/0/ s/0/0/ Weights: { [0,0,0,0:2,16,1,1), } Inputs: { [0,0,8,14:1,2,16,15), } Outputs: { [0,0,14,8:1,16,15,16), } 
      t/8/1/ s/0/0/ Weights: { [0,128,0,0:2,144,1,1), } Inputs: { [0,0,8,14:1,2,16,15), } Outputs: { [0,128,14,8:1,144,15,16), } 
      t/8/2/ s/0/0/ Weights: { [0,0,0,0:2,16,1,1), } Inputs: { [0,0,8,16:1,2,16,17), } Outputs: { [0,0,16,8:1,16,17,16), } 
      t/8/3/ s/0/0/ Weights: { [0,128,0,0:2,144,1,1), } Inputs: { [0,0,8,16:1,2,16,17), } Outputs: { [0,128,16,8:1,144,17,16), } 
    t/8/ s/0/ Weights: { [0,0,0,0:2,256,1,1), } Inputs: { [0,0,8,14:1,2,16,28), } Outputs: { [0,0,14,8:1,256,28,16), } 
  t/ s/ Weights: { [0,0,0,0:2,256,1,1), } Inputs: { [0,0,0,0:1,2,56,56), } Outputs: { [0,0,0,0:1,256,56,56), } 

Here's how to read the trace:

For more background on hierarchical space/time stamps you can refer to this paper: https://research.nvidia.com/publication/2021-01_hardware-abstractions-targeting-eddo-architectures-polyhedral-model

nivi1501 commented 1 year ago

Thanks a lot for sharing this valuable information. This precise explanation helped me a lot. I tried generating the 'delta' trace and got the following results. ` t/0/191/ s/0/10/ Weights: { [26,31,2:27,32,3), } Inputs: { [26,17:27,18), } Outputs: { }

  t/0/191/ s/0/11/ Weights: { [27,31,2:28,32,3), } Inputs: { [27,17:28,18), } Outputs: { } 

  t/0/191/ s/0/12/ Weights: { [28,31,2:29,32,3), } Inputs: { [28,17:29,18), } Outputs: { } 

  t/0/191/ s/0/13/ Weights: { [29,31,2:30,32,3), } Inputs: { [29,17:30,18), } Outputs: { } 

  t/0/191/ s/0/14/ Weights: { [30,31,2:31,32,3), } Inputs: { [30,17:31,18), } Outputs: { } 

  t/0/191/ s/0/15/ Weights: { [31,31,2:32,32,3), } Inputs: { [31,17:32,18), } Outputs: { } 

t/0/ s/0/ Weights: { [0,0,0:32,32,3), } Inputs: { [0,0:32,18), } Outputs: { [0,0:32,16), } 

t/ s/ Weights: { [0,0,0:32,32,3), } Inputs: { [0,0:32,18), } Outputs: { [0,0:32,16), } ` Now, I just need to focus on the DRAM to global buffer tile movement (the rest of the stuff is just noise to me). What I can deduce is at t/1/ s/0/, an additional 11616 weights and 2497 input elements are read from the DRAM as you mentioned "Delta trace represents incremental data i.e. moved to construct the tile" However, the output remains stationary in the global buffer. Please let me know if my inferences are correct.

t/0/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/1/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/2/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/3/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/4/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/5/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/6/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/7/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/8/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/9/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/10/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/11/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/12/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/13/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/14/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/15/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/16/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/17/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/18/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/19/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/20/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/21/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/22/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/23/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/24/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/25/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/26/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/27/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/28/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/29/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/30/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/31/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/32/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/33/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/34/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/35/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/36/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280