Xilinx / finn

Dataflow compiler for QNN inference on FPGAs
https://xilinx.github.io/finn
BSD 3-Clause "New" or "Revised" License
760 stars 243 forks source link

New Folding Optimizer #1187

Open lstasytis opened 2 months ago

lstasytis commented 2 months ago

A large design challenge in FINN is deciding on the correct amount of parallelism to set for each individual node in the dataflow graph (folding) to maximize throughput while minimizing resource consumption.

The current solution to folding in FINN is a simple transformation (SetFolding()) which traverses the dataflow graph node-by-node and increases each parameter responsible for parallelism incrementally until the given throughput target is met.

However, this solution overlooks the many implications which folding has on resource consumption and how two configurations leading to a near-identical throughput can have vastly different resource consumption characteristics.

For example, nodes with two degrees of parallelism could be folded as either (X,Y) or (Y,X) where X and Y are two possible parameter choices and will have the same throughput for the node itself. However, the two variations lead to the input and output stream widths being different (input being equal to the datawidth times X or Y and likewise for the output). This difference in input and output stream widths, as well as transaction processing behavior will affect the Data Width Converters (DWCs) and buffers (FIFOs) being inserted between the nodes. An output stream of a node being folded to the same degree as the input stream of the subsequent node will result in no DWC insertion if the node data types are identical. Furthermore, between two inverse choices for folding a dual-parameter node, the resource consumption of the node itself can differ based on which degree of parallelism has a higher resource cost associated with it. Lastly, when searching for the optimal throughput achievable for a given resource budget, some nodes may need to be folded in a highly specific way to achieve the throughput necessary to not be a bottleneck, while also not over-provisioning resources. For example, if we wanted to speed up a node by 6x to achieve the desired throughput and our folding factor options were [1,3,7] and [1,2,4], the optimal configuration would be the pair (3,2). Any other combination of pairs will be less or more than 6 when multiplied, leading to either becoming a bottleneck or using more resources than necessary.

In light of this, with this PR, we introduce a new SetFolding() transformation which considers the implications of folding to lead to a much more robust result by making use of the resource estimate functions in FINN to handle the folding implications on the dataflow graph without having to introduce complex rulesets.

The new transformation treats the problem of folding as a global optimization problem with bounded integer variables. By generalizing the problem in such a way, we are able to make use of robust minimization algorithms to find optimal or near-optimal solutions while incrementally introducing any necessary heuristics to improve the optimization outcome.

The current minimizer of choice is an implementation of simulated annealing called dual_basin from scipy's library of minimizers.

The cost function has a hard bound on the throughput target while increasing cost for each resource being used (LUT, BRAM, URAM, DSP) proportionately to the selected platform's capacity.

Due to simulated annealing being a time-consuming algorithm to run (many re-computations of the cost function), we introduce a divide and conquer-style breakdown of the problem. More specifically, using a parameter exposed to the user (max_parameters_per_partition) we perform the optimizations on a small subset of nodes of the overarching model one by one, using local resource consumption estimates for the relevant nodes. Once all partitions are optimized, we apply the chosen folding parameters all at once on the full model and compute the total resource consumption and throughput. Based on whether the resultant resource consumption is within the constraints and whether the throughput target was reached, we either relax or tighten the throughput requirements and repeat the local optimizations until we arrive at a globally optimal or near-optimal solution. This way, the runtime scales linearly with model size.

The transformation now features a large set of additional parameters which control how it is performed. The most notable ones are:

On padding:

One of the most powerful design introductions with the new folding optimizer is making use of channel dimension padding for relaxing folding factor constraints. More specifically, currently in finn, a folding a factor choice must be a divisor of the bounding parameter (usually channel size) since nodes in finn cannot inherently deal with going out of bounds in their SIMD operations. During the new folding optimizer runs, an assumption can be made that node channel dimensions can be padded by some number of elements from a list. If a folding factor which can only be achieved as a result of a given padding is chosen by the optimizer, the padding is applied to the node in terms of the tensor shapes and bounding parameter node attribute value alongside the actual folding parameter.

However, the functional padding support is only possible if the generalized DWC from the branch https://github.com/lstasytis/finn/tree/feature/generalized-datawidthconverter is also merged into the repository. Without this DWC performing the necessary padding, using folding_maximum_padding values above 0 will result in faulty models and so should not be used.

Architecture:

The minimizer function itself in the folding optimizer currently uses hyper parameters (accept, visit, maxfun, maxiter) chosen empirically during the initial design passes on the transformation. These parameters should be tuned, which will lead to more optimal solutions and a reduction in runtime. ram_style and resType are currently disabled as 'folding' parameters to optimize, however are useful, especially when it comes to deciding on the correct memory to use for weight storage.