NVlabs / timeloop

Timeloop performs modeling, mapping and code-generation for tensor algebra workloads on various accelerator architectures.
https://timeloop.csail.mit.edu/
BSD 3-Clause "New" or "Revised" License
319 stars 99 forks source link

scalar reads, updates and fills #157

Closed suyashbakshi closed 1 year ago

suyashbakshi commented 1 year ago

Could you help me have a clear definition of scalar reads, updates and fills please? As I understand,

What about the writes from a level above into current level? Are these accounted from the scalar reads of level above? If so, then from the output of timeloop-model, it does not appear that the energy for writes from level above is added into the total energy for current level.

Also, for what cases could the scalar updates be larger than scalar reads? One case that I can think of is, when a data element is broadcasted to the multiple instances of the level below and if the broadcast is considered as a single read operation. But updates from multiple instances (i.e. a reduction operation) are counted individually.

angshuman-parashar commented 1 year ago

Your understanding of reads and updates are correct but not fills. Think of cache fills, i.e., the number of scalar elements that were brought in to this level from a parent level. If there's 1 parent and 1 child then parent reads == child fills. But if there's multicast then the equation is more complicated. More a more precise definition you can read this paper: https://dl.acm.org/doi/10.1145/3297858.3304025

suyashbakshi commented 1 year ago

Thanks. It appears that the scalar updates are a property of the parent level. That is, number of times data items in parent level are updated by child level. But the child level must be consuming some energy to write data into parent level, which does not seem to accounted for in the cost model?

Also, for the attached timeloop-model output the scalar reads for Output operand at GlobalBuffer and MainMemory levels are reported to be zero. Could you please help explain why that would be the case? timeloop-model.stats.txt

angshuman-parashar commented 1 year ago

Update counts reported at a given level are the number of update scalars it receives from a child. The level in question (i.e., parent/destination) pays the write cost, and the network between the parent and the child pays a data transfer cost. The child's costs are depend on the cadence of data transfer between itself and its child.

For your specific example scalar reads are zero for the Output tensor probably because no partial sums are ever written to it. Only final sums are written, which means nothing is ever read. In general the first read of an Output tensor is always elided by Timeloop because there is no need to read a 0. Thus, the first "update" operation is a true update instead of a read-modify-update.

suyashbakshi commented 1 year ago

Thanks, that makes sense.

suyashbakshi commented 1 year ago

I have question related to the attached timeloop-model output. The output is for a hardware with 256PEs each with a private register file, a shared global buffer and a main memory. If you notice the scalar fills for Outputs at register file level are 87808 per instance, but the scalar reads for outputs at global buffer level are zero. If my understanding is correct, the scalar fills at a given level must be equal (or proportional, depending on the multicast factor etc.) to the scalar reads of its parent level. This same behavior is observed for the pair of global buffer and main memory. My question is, what condition might cause this behavior?

timeloop-model.stats.txt

angshuman-parashar commented 1 year ago

I'm struggling to recall if there was some specific rationale there. Could it have been a 0-initialization cost? Does that make the numbers consistent?

Even if it does, it should be possible to elide that cost, as long as the first "Update" is allowed to overwrite whatever junk that location had previously.

Modeling the first-read-elision has always been a little problematic. There's an entire code path with the flag UpdatedRMW that uses a different (and more accurate) approach to computing this, but that had its own issues.

If we convince ourselves that this is a bug, I would still be hesitant to push out a hotfix immediately. We have a big upcoming upgrade to NestAnalysis in the form of an ISL-based implementation within a few weeks, which should make everything a lot more precise. I'm hoping these artifacts will go away peacefully.

suyashbakshi commented 1 year ago

Yes, if one considers a 0-initialization cost, then the numbers are consistent. The only inconsistency is the cost not being accurately accounted (or elided) for "both" levels.

I believe the first-read-elision works correctly for cases when Outputs are read multiple times, but if they are read only once, then this inconsistency pops up. What would make sense is to probably have an extra step that checks metric consistency between parent and children, but that would obviously add to the complexity and computational cost of evaluating mappings.

Attached are arch, mapping and prob yaml's. arch.txt prob.txt mapping.txt

angshuman-parashar commented 1 year ago

The computation cost of comparing per-level/per-tensor access metrics is very minor relative to the recursive nest/multicast analysis. If you think there's an easy fix to the logic in tiling.cpp to check/compute fills then please give it a shot and let us know.

suyashbakshi commented 1 year ago

Thanks, I will try to fix this. I see that this issue has been there for a while and you suggested to subtract tile[pvi].partition_size. Could you please explain what partition_size represents?

angshuman-parashar commented 1 year ago

Thanks, I will try to fix this. I see that this issue has been there for a while and you suggested to subtract tile[pvi].partition_size. Could you please explain what partition_size represents?

It's the number of distinct points of a tensor that flow through a single instance of the buffer at that level. You can think of it as the union of all tiles that ever flow through that instance.

angshuman-parashar commented 1 year ago

@suyashbakshi could you please check if 59fdc832acfb3f20e58f566dcfd569c207ec673b fixes the issue?

suyashbakshi commented 1 year ago

Hello, yes, it does seem to fix the issue for at least the case I posted above. I will keep an eye out for any other cases. Thanks for taking the time to address this. Much appreciated.

suyashbakshi commented 1 year ago

On second thought, the fills at outermost level are probably broken. Please see the fills for main memory in timeloop-model.stats.txt attached. It is possible that this is happening just for me as I've made several modifications to other files in the source code. Could you double check on your end with the attached arch, prob and mapping file if you get the same stats for Main Memory?

timeloop-model-stats.txt mapping.txt arch.txt prob.txt

angshuman-parashar commented 1 year ago

I'm getting MainMemory fills == 0 when I run those YAMLs with the HEAD version of Timeloop (all other non-energy stats are identical), so I think it's your local changes that are conflicting with this update. Fortunately that number you're seeing appears to be an integer overflow so it should be easy to track down.

suyashbakshi commented 1 year ago

Ok thanks, I will debug my local version.