Open nivi1501 opened 2 years ago
What if the same tile is written multiple times (i.e., read-modified-updated)? Do you want to count those as multiple tile writes, or do you only want to count the number of distinct tiles? In either case you just want the total number (and are not trying to generate a trace of labelled tile writes), correct?
I am trying to generate a trace of labeled tile writes and reads (from DRAM to Global buffer). Basically, I am trying to study the DRAM to Global buffer traffic at the tile level. I wish to generate traces similar as:
(TileID) (Number of elements in a tile) (Type of access (R/W)) T1 512 R T2 1024 W .... .... T2 1024 W ... ...
I am already familiar with how to estimate tile sizes and the total number of tiles using Timeloop. I just wish to know if I can generate a similar trace file using TL and if it is possible, then I should focus on which source file to generate this. Any help in this matter will be highly appreciated. Looking forward to your reply.
Try the tracing feature. It will emit a trace of the axis-aligned hyper-rectangles that the nest analysis visits at each coordinate in space-time.
You will also have to disable temporal (and maybe spatial) extrapolation. Note that this will massively slow down simulation speed. This is because with extrapolation disabled Timeloop starts behaving more like a cycle-level simulator than a fast analytical model. You should also probably only use this with timeloop-model
on a specific mapping. Using tracing with the mapper will just generate a ton of noise that's hard to deal with.
To enable all this, set the following env variables:
TIMELOOP_ENABLE_TRACING=1
TIMELOOP_DISABLE_TEMPORAL_EXTRAPOLATION=1
TIMELOOP_DISABLE_SPATIAL_EXTRAPOLATION=1
and then run timeloop-model
as you normally do.
The trace output will look something like this:
t/7/ s/0/ Weights: { [0,0,0,0:2,256,1,1), } Inputs: { [0,0,0,14:1,2,8,28), } Outputs: { [0,0,14,0:1,256,28,8), }
t/8/0/ s/0/0/ Weights: { [0,0,0,0:2,16,1,1), } Inputs: { [0,0,8,14:1,2,16,15), } Outputs: { [0,0,14,8:1,16,15,16), }
t/8/1/ s/0/0/ Weights: { [0,128,0,0:2,144,1,1), } Inputs: { [0,0,8,14:1,2,16,15), } Outputs: { [0,128,14,8:1,144,15,16), }
t/8/2/ s/0/0/ Weights: { [0,0,0,0:2,16,1,1), } Inputs: { [0,0,8,16:1,2,16,17), } Outputs: { [0,0,16,8:1,16,17,16), }
t/8/3/ s/0/0/ Weights: { [0,128,0,0:2,144,1,1), } Inputs: { [0,0,8,16:1,2,16,17), } Outputs: { [0,128,16,8:1,144,17,16), }
t/8/ s/0/ Weights: { [0,0,0,0:2,256,1,1), } Inputs: { [0,0,8,14:1,2,16,28), } Outputs: { [0,0,14,8:1,256,28,16), }
t/ s/ Weights: { [0,0,0,0:2,256,1,1), } Inputs: { [0,0,0,0:1,2,56,56), } Outputs: { [0,0,0,0:1,256,56,56), }
Here's how to read the trace:
t/
and s/
refers to the outermost (e.g., DRAM) level, because the tile never changes there over space or time -- it's the complete tensor. The rank-1 stamps t/8/
and s/0/
refer to the next-inner level (probably the GlobalBuffer), and in this case is telling you the tile resident at the GlobalBuffer space-coordinate 0
and at time-step 8
. As you go deeper into the hierarchy, the rank order of the time and space stamps increases.nest-analysis.cpp
to optionally emit the Delta trace as well. This will be a valuable contribution to the tool.tiling.cpp
. By the time Timeloop gets to that stage of processing, all fine-grained information about space and time is discarded, and it's not generate a trace there. So you may have to do some outboard post-processing if you want to incorporate bypassing into the trace.For more background on hierarchical space/time stamps you can refer to this paper: https://research.nvidia.com/publication/2021-01_hardware-abstractions-targeting-eddo-architectures-polyhedral-model
Thanks a lot for sharing this valuable information. This precise explanation helped me a lot. I tried generating the 'delta' trace and got the following results. ` t/0/191/ s/0/10/ Weights: { [26,31,2:27,32,3), } Inputs: { [26,17:27,18), } Outputs: { }
t/0/191/ s/0/11/ Weights: { [27,31,2:28,32,3), } Inputs: { [27,17:28,18), } Outputs: { }
t/0/191/ s/0/12/ Weights: { [28,31,2:29,32,3), } Inputs: { [28,17:29,18), } Outputs: { }
t/0/191/ s/0/13/ Weights: { [29,31,2:30,32,3), } Inputs: { [29,17:30,18), } Outputs: { }
t/0/191/ s/0/14/ Weights: { [30,31,2:31,32,3), } Inputs: { [30,17:31,18), } Outputs: { }
t/0/191/ s/0/15/ Weights: { [31,31,2:32,32,3), } Inputs: { [31,17:32,18), } Outputs: { }
t/0/ s/0/ Weights: { [0,0,0:32,32,3), } Inputs: { [0,0:32,18), } Outputs: { [0,0:32,16), }
t/ s/ Weights: { [0,0,0:32,32,3), } Inputs: { [0,0:32,18), } Outputs: { [0,0:32,16), } ` Now, I just need to focus on the DRAM to global buffer tile movement (the rest of the stuff is just noise to me). What I can deduce is at t/1/ s/0/, an additional 11616 weights and 2497 input elements are read from the DRAM as you mentioned "Delta trace represents incremental data i.e. moved to construct the tile" However, the output remains stationary in the global buffer. Please let me know if my inferences are correct.
t/0/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/1/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/2/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/3/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/4/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/5/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/6/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/7/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/8/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/9/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/10/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/11/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/12/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/13/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/14/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/15/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/16/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/17/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/18/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/19/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/20/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/21/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/22/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/23/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/24/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/25/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/26/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/27/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/28/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/29/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/30/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/31/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/32/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/33/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280 t/34/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/35/ s/0/Weights = 11616, Inputs = 2497, Outputs = 0 t/36/ s/0/Weights = 11616, Inputs = 2497, Outputs = 5280
Hi, I wish to keep track of all the output tiles which are being written to the main memory. Basically, I want to: