Dataflows of eyeriss, simba and chen-asplos2014

aleczhanshi commented 4 years ago

Hi there,

I do see the architectures and a set of constraints of eyeriss, simba and chen-asplos2014, but how could I get the the default dataflows without a mapper? Basically, in addition to dataflow found by Timeloop's search model, I'm trying to compare to the original dataflows on those accelerators as baselines, given different problems / layers. I assume that the search model is definitely not reproducing their results right?

Another related question is, without a mapper, how are these accelerators picking up a valid and efficient dataflow for each problem / layer? I assume that the dataflow will change with problems, at least the factors.

Thanks!!!

angshuman-parashar commented 4 years ago

This is a good, deep question that gets to the root of one of the reasons why Timeloop was created to begin with. I think you almost answered your own question: you cannot evaluate a workload on an architecture without a mapper - unless you're talking about a very silly toy architecture that supports exactly one dataflow. If the architecture supports any flexibility at all (beyond the outermost loop limits) - you need some heuristic to select from one of the available options for a workload. That heuristic could be a programmer thinking "hmm, I think this will work well", or it could be a script (for simple architectures), or it could be a sophisticated mapper (for complex architectures with a lot of flexibility). These are all examples of mappers.

All of the architectures you mentioned have their own mappers that were tailor-made for that architecture. We know this for a fact for Eyeriss and Simba, and it is an educated guess for other architectures. Unlike these architecture-specific mappers, Timeloop provides a generic mapper that works for every architecture that you can describe in the infrastructure.

In the validation experiments vs. Eyeriss in our ISPASS paper, we compared the final results from the full mapper+model stack. We did not perform a decoupled validation of the mapper and model separately, but it may be an interesting experiment - provided you have access to the architecture-specific mappers in question.

aleczhanshi commented 4 years ago

@angshuman-parashar Thanks for the great explanation! That makes a lot of sense. A few quick questions regrading your answers are

as dataflows are part of the hardware architecture but need to vary with workloads, how could these accelerators change it for different problems / layers? In other words, are hardware dataflows reconfigurable in hardware?
how are the optimizations on dataflows different from the compiler optimizations, like TVM? Are they alternatives or do they both exist for an accelerator?

aleczhanshi commented 4 years ago

@angshuman-parashar To make it clear, the compiler optimizations is more of those on the loop (like factorization and permutation), instead of operation fusion. Based on my understanding, the compiler optimization just organize the data in software trying to fit the existing hardware, and it does not change the way that data flows in hardware. The dataflow optimization is doing something similar, but in hardware and does change the way that data moves in hardware. Is that correct?

angshuman-parashar commented 4 years ago

Think of the dataflow as the complete spatial/temporal loop nest. Some of it (e.g., the innermost levels) is baked in hardware, while the remainder is exposed to software (i.e., the mapper) so as to adapt to different workloads. The former is expressed as constraints in Timeloop's language, while the space of options in the latter is explored by the mapper.
Timeloop's mapper component and TVM have similar objectives - how to take an unmapped workload and map it onto a specific hardware architecture. Beyond this high-level objective, there are many differences. I won't comment on TVM, but if you wish to compare the infrastructures you should ask questions about: (a) the model - the set of architectures it supports, the cost metrics it can evaluate, the ability to perform architecture design-space exploration, the approach used to describe dataflow constraints, etc. (b) the mapper - the set of transformations it supports, the search heuristics, the range of workloads you can describe, the analysis approach, and the optimization parameters that can be used as feedback from the model.

aleczhanshi commented 4 years ago

@angshuman-parashar Thanks! So if I understand correctly, the definition of dataflow can span across both hardware (in the form of constraints in Timeloop) and software (which are essentially compiler optimizations on loops). Is that correct?

angshuman-parashar commented 4 years ago

That's correct. I should point out - I believe Chen, Sze and Emer (the original Eyeriss authors) now use an even stricter definition of the term "dataflow" to refer to only loop-nest order (i.e., permutation) and not loop bounds. This is orthogonal to the hardware/software split. I can try to find a reference, but the bottom line is - while perusing work in this area, keep in mind that authors may be using slightly different interpretations of the term "dataflow". Within the Timeloop context we prefer to use the terms "mapping" and "mapspace constraints" because of the lower risk of aliasing vs. conflicting definitions.

aleczhanshi commented 4 years ago

@angshuman-parashar That makes a lot of sense. Maybe I'll try to summarize the terminology somewhere (and point to this thread) when I understand them better.

Another question (should be the final one) is the distinction between hardware and software optimizations. Is there a clear border line between hardware and software (or mapping constraints and mapping)? For example, are there any reasons that something have to be baked in hardware and others are more flexible to be on either side?

Thanks for being such responsive. I really appreciate it and it makes me much more comfortable working with this great tool!

Zhan

angshuman-parashar commented 4 years ago

There's no single answer. Hardware flexibility comes at a cost - programmable state machines (allowing for loop interchanges), routed networks and/or mux trees (allowing for different spatial mappings) have area and energy costs. This cost is often acceptable at the outer levels of an architecture's hierarchy (where units are already larger and more expensive, and accesses are hopefully fewer thanks to reuse filtering from inner levels), but unacceptable at the innermost levels where frequent accesses and highly-replicated small units may mandate lean, inflexible designs. Exactly where you draw the line depends on a large number of factors, and different architectures make different choices.

aleczhanshi commented 4 years ago

Thanks for all the detailed explanation, and I'll close this issue!

NVlabs / timeloop

Dataflows of eyeriss, simba and chen-asplos2014 #21