A large design challenge in FINN is deciding on the correct amount of parallelism to set for each individual node in the dataflow graph (folding) to maximize throughput while minimizing resource consumption.
The current solution to folding in FINN is a simple transformation (SetFolding()) which traverses the dataflow graph node-by-node and increases each parameter responsible for parallelism incrementally until the given throughput target is met.
However, this solution overlooks the many implications which folding has on resource consumption and how two configurations leading to a near-identical throughput can have vastly different resource consumption characteristics.
For example, nodes with two degrees of parallelism could be folded as either (X,Y) or (Y,X) where X and Y are two possible parameter choices and will have the same throughput for the node itself. However, the two variations lead to the input and output stream widths being different (input being equal to the datawidth times X or Y and likewise for the output). This difference in input and output stream widths, as well as transaction processing behavior will affect the Data Width Converters (DWCs) and buffers (FIFOs) being inserted between the nodes. An output stream of a node being folded to the same degree as the input stream of the subsequent node will result in no DWC insertion if the node data types are identical. Furthermore, between two inverse choices for folding a dual-parameter node, the resource consumption of the node itself can differ based on which degree of parallelism has a higher resource cost associated with it. Lastly, when searching for the optimal throughput achievable for a given resource budget, some nodes may need to be folded in a highly specific way to achieve the throughput necessary to not be a bottleneck, while also not over-provisioning resources. For example, if we wanted to speed up a node by 6x to achieve the desired throughput and our folding factor options were [1,3,7] and [1,2,4], the optimal configuration would be the pair (3,2). Any other combination of pairs will be less or more than 6 when multiplied, leading to either becoming a bottleneck or using more resources than necessary.
In light of this, with this PR, we introduce a new SetFolding() transformation which considers the implications of folding to lead to a much more robust result by making use of the resource estimate functions in FINN to handle the folding implications on the dataflow graph without having to introduce complex rulesets.
The new transformation treats the problem of folding as a global optimization problem with bounded integer variables. By generalizing the problem in such a way, we are able to make use of robust minimization algorithms to find optimal or near-optimal solutions while incrementally introducing any necessary heuristics to improve the optimization outcome.
The current minimizer of choice is an implementation of simulated annealing called dual_basin from scipy's library of minimizers.
The cost function has a hard bound on the throughput target while increasing cost for each resource being used (LUT, BRAM, URAM, DSP) proportionately to the selected platform's capacity.
Due to simulated annealing being a time-consuming algorithm to run (many re-computations of the cost function), we introduce a divide and conquer-style breakdown of the problem. More specifically, using a parameter exposed to the user (max_parameters_per_partition) we perform the optimizations on a small subset of nodes of the overarching model one by one, using local resource consumption estimates for the relevant nodes. Once all partitions are optimized, we apply the chosen folding parameters all at once on the full model and compute the total resource consumption and throughput. Based on whether the resultant resource consumption is within the constraints and whether the throughput target was reached, we either relax or tighten the throughput requirements and repeat the local optimizations until we arrive at a globally optimal or near-optimal solution. This way, the runtime scales linearly with model size.
The transformation now features a large set of additional parameters which control how it is performed. The most notable ones are:
folding_style (str) = ["naive", "optimizer"] (either use the old simple automatic folding routine (pro of being extremely fast) or the new optimizer-based routine (slower, but far more efficient)).
folding_maximum_padding = [int] (whether to make use of channel dimension padding to relax folding factor constraints on the optimizer, more on this below)
folding_pad_io_nodes = [bool] (whether to pad the IO nodes, which, if the folding_maximum_padding parameter is set to more than 0 can result in IO node shapes changing ,which would require modifications to the host code (padding of the input and cropping of the output))
enable_folding_dwc_heuristic = [bool] (whether to consider how folding may introduce DWCs when computing the cost function of the optimizer as well as final resource consumption use in the global loop of the transformation)
enable_folding_fifo_heuristic = [bool] (whether to insert FIFOs once during each global pass on the model to compute the final resource consumption)
folding_effort = [int] (how many function calls to allow to the minimizer during folding)
folding_max_attempts = [int] (how many times to attempt folding in sequence. A value of 1 means that only one attempt will be made to hit the target throughput specified. A value greater than 1 means that after the first attempt, the optimizer will try to maximize throughput to the resource limit. We use a binary search to zero in on the highest throughput which still meets the resource constraints).
On padding:
One of the most powerful design introductions with the new folding optimizer is making use of channel dimension padding for relaxing folding factor constraints. More specifically, currently in finn, a folding a factor choice must be a divisor of the bounding parameter (usually channel size) since nodes in finn cannot inherently deal with going out of bounds in their SIMD operations. During the new folding optimizer runs, an assumption can be made that node channel dimensions can be padded by some number of elements from a list. If a folding factor which can only be achieved as a result of a given padding is chosen by the optimizer, the padding is applied to the node in terms of the tensor shapes and bounding parameter node attribute value alongside the actual folding parameter.
However, the functional padding support is only possible if the generalized DWC from the branch https://github.com/lstasytis/finn/tree/feature/generalized-datawidthconverter is also merged into the repository. Without this DWC performing the necessary padding, using folding_maximum_padding values above 0 will result in faulty models and so should not be used.
Architecture:
SetFolding - top level class, which either calls the naive folding algorithm or the new folding optimizer. The outer loop which tries to maximize throughput is featured in this class, calling the Optimizer class multiple times.
Optimizer - the main optimizer class, which features the functions for the parameter collection from the model and actual optimization functions.
cost_model() - cost model function, which calls the resource consumption estimate functions and counts the final cost, while calling additional heuristics.
execute_minimizer() - minimizer function which passes the parameters to the scipy minimization function (dual_basin by default)
optimize() - main optimization loop, which partitions a set of parameters and calls execute_minimizer()+cost_model() for each partition, while also applying the parameter and padding parameter changes to the model.
generate_parameter_set() - collects all the parameters from the model, adhering to the folding and padding whitelist and any additional rules. Most specifically, convolutioninputgenerator -> Pool (optional) -> MVAU/VVAU node chains are reduced to a single meta parameter, propagating the convolutioninputgenerator's Channel dimension padding to construct larger feature maps for the MVAU/VVAU, retaining functional correctness.
parameter_whitelist() - a dictionary of all folding parameters and nodes, which the folding optimizer can either fold or/and pad. This whitelist is used during parameter collection and the construction of viable folding value sets for each parameter
Parameter - class holding an individual foldable parameter. The class has pointers to the relevant nodes, model itself and bounding parameter so that they may be fully updated depending on what folding choice is made (primary relevant for padding).
MetaParameter - class holding a set of Parameter class objects which are tightly linked together (this is only relevant for node chains making up convolutions). In most cases, a MetaParameter just holds a single Parameter object.
ParameterSet - a class holding a set of MetaParameters in a single partition for folding optimization.
Improvements to be made:
The minimizer function itself in the folding optimizer currently uses hyper parameters (accept, visit, maxfun, maxiter) chosen empirically during the initial design passes on the transformation. These parameters should be tuned, which will lead to more optimal solutions and a reduction in runtime.
ram_style and resType are currently disabled as 'folding' parameters to optimize, however are useful, especially when it comes to deciding on the correct memory to use for weight storage.
A large design challenge in FINN is deciding on the correct amount of parallelism to set for each individual node in the dataflow graph (folding) to maximize throughput while minimizing resource consumption.
The current solution to folding in FINN is a simple transformation (SetFolding()) which traverses the dataflow graph node-by-node and increases each parameter responsible for parallelism incrementally until the given throughput target is met.
However, this solution overlooks the many implications which folding has on resource consumption and how two configurations leading to a near-identical throughput can have vastly different resource consumption characteristics.
For example, nodes with two degrees of parallelism could be folded as either (X,Y) or (Y,X) where X and Y are two possible parameter choices and will have the same throughput for the node itself. However, the two variations lead to the input and output stream widths being different (input being equal to the datawidth times X or Y and likewise for the output). This difference in input and output stream widths, as well as transaction processing behavior will affect the Data Width Converters (DWCs) and buffers (FIFOs) being inserted between the nodes. An output stream of a node being folded to the same degree as the input stream of the subsequent node will result in no DWC insertion if the node data types are identical. Furthermore, between two inverse choices for folding a dual-parameter node, the resource consumption of the node itself can differ based on which degree of parallelism has a higher resource cost associated with it. Lastly, when searching for the optimal throughput achievable for a given resource budget, some nodes may need to be folded in a highly specific way to achieve the throughput necessary to not be a bottleneck, while also not over-provisioning resources. For example, if we wanted to speed up a node by 6x to achieve the desired throughput and our folding factor options were [1,3,7] and [1,2,4], the optimal configuration would be the pair (3,2). Any other combination of pairs will be less or more than 6 when multiplied, leading to either becoming a bottleneck or using more resources than necessary.
In light of this, with this PR, we introduce a new SetFolding() transformation which considers the implications of folding to lead to a much more robust result by making use of the resource estimate functions in FINN to handle the folding implications on the dataflow graph without having to introduce complex rulesets.
The new transformation treats the problem of folding as a global optimization problem with bounded integer variables. By generalizing the problem in such a way, we are able to make use of robust minimization algorithms to find optimal or near-optimal solutions while incrementally introducing any necessary heuristics to improve the optimization outcome.
The current minimizer of choice is an implementation of simulated annealing called dual_basin from scipy's library of minimizers.
The cost function has a hard bound on the throughput target while increasing cost for each resource being used (LUT, BRAM, URAM, DSP) proportionately to the selected platform's capacity.
Due to simulated annealing being a time-consuming algorithm to run (many re-computations of the cost function), we introduce a divide and conquer-style breakdown of the problem. More specifically, using a parameter exposed to the user (max_parameters_per_partition) we perform the optimizations on a small subset of nodes of the overarching model one by one, using local resource consumption estimates for the relevant nodes. Once all partitions are optimized, we apply the chosen folding parameters all at once on the full model and compute the total resource consumption and throughput. Based on whether the resultant resource consumption is within the constraints and whether the throughput target was reached, we either relax or tighten the throughput requirements and repeat the local optimizations until we arrive at a globally optimal or near-optimal solution. This way, the runtime scales linearly with model size.
The transformation now features a large set of additional parameters which control how it is performed. The most notable ones are:
folding_style (str) = ["naive", "optimizer"] (either use the old simple automatic folding routine (pro of being extremely fast) or the new optimizer-based routine (slower, but far more efficient)).
folding_maximum_padding = [int] (whether to make use of channel dimension padding to relax folding factor constraints on the optimizer, more on this below)
folding_pad_io_nodes = [bool] (whether to pad the IO nodes, which, if the folding_maximum_padding parameter is set to more than 0 can result in IO node shapes changing ,which would require modifications to the host code (padding of the input and cropping of the output))
enable_folding_dwc_heuristic = [bool] (whether to consider how folding may introduce DWCs when computing the cost function of the optimizer as well as final resource consumption use in the global loop of the transformation)
enable_folding_fifo_heuristic = [bool] (whether to insert FIFOs once during each global pass on the model to compute the final resource consumption)
folding_effort = [int] (how many function calls to allow to the minimizer during folding)
folding_max_attempts = [int] (how many times to attempt folding in sequence. A value of 1 means that only one attempt will be made to hit the target throughput specified. A value greater than 1 means that after the first attempt, the optimizer will try to maximize throughput to the resource limit. We use a binary search to zero in on the highest throughput which still meets the resource constraints).
On padding:
One of the most powerful design introductions with the new folding optimizer is making use of channel dimension padding for relaxing folding factor constraints. More specifically, currently in finn, a folding a factor choice must be a divisor of the bounding parameter (usually channel size) since nodes in finn cannot inherently deal with going out of bounds in their SIMD operations. During the new folding optimizer runs, an assumption can be made that node channel dimensions can be padded by some number of elements from a list. If a folding factor which can only be achieved as a result of a given padding is chosen by the optimizer, the padding is applied to the node in terms of the tensor shapes and bounding parameter node attribute value alongside the actual folding parameter.
However, the functional padding support is only possible if the generalized DWC from the branch https://github.com/lstasytis/finn/tree/feature/generalized-datawidthconverter is also merged into the repository. Without this DWC performing the necessary padding, using folding_maximum_padding values above 0 will result in faulty models and so should not be used.
Architecture:
The minimizer function itself in the folding optimizer currently uses hyper parameters (accept, visit, maxfun, maxiter) chosen empirically during the initial design passes on the transformation. These parameters should be tuned, which will lead to more optimal solutions and a reduction in runtime. ram_style and resType are currently disabled as 'folding' parameters to optimize, however are useful, especially when it comes to deciding on the correct memory to use for weight storage.