cornell-zhang / heterocl

HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Heterogeneous Computing
https://cornell-zhang.github.io/heterocl/
Apache License 2.0
322 stars 93 forks source link

Enhance the graph partitioning algorithm to determine host-device bound with incomplete placement information #319

Open hecmay opened 3 years ago

hecmay commented 3 years ago

In the prototyping, we created a naive graph partition algorithm to separate the DFG into host scope and device scope. The naive algorithm only supports generating a single kernel function, and assumes all the boundary edges of the subgraph (kernel scope) have been specified using .to() primitive.

This assumption is too strong to be satisfied in development of realistic designs since that 1) The optimal placement strategy is not known at compile-time 2) this raise higher barrier for trying different placement scheme, especially for designs with large numbers of input/outputs.

We need to an adaptive graph partitioning algorithm that can address the following needs: it should be able to find a valid graph partitioning scheme that satisfy various constraints, including the ones provided by users, as well as some physical constraints (e.g. the on-chip area consumption cannot exceed the budget). Also we expect it to be fast enough and does not introduce much compile-time overhead.

Tradition graph partitioning algorithms try to find a balanced k-way partition with the minimum cut sizes. This is a bit different from our goals. We do not need the partition to be balanced -- it is okay to offload all ops to FPGA if the application is able to fit in. The cut size should be as small as possible.

A variation from classic local improvement algorithm should be able to address our need: https://ep.liu.se/ea/cis/1998/010/cis98010.pdf I am not sure if this is gonna work, but it worth trying. In the paper, the cost model is defined as g(v) = ext(v) - int(v). It measures much gain we can get by moving a node from the current partition group to another.

The formula is well defined in the paper on page 4, and is quite intuitive. So I do not put much efforts explaining it here in this post. We need to re-define the g(v) to reflect our needs. It should be re-defined as something like: g(v) = reduced_area_if_moving + reduced_cut_size_if_moving - potential_performance_loss

Here is just a draft of what I am thinking... Please provide your valuable feedback if possible

  1. Constraints: all the input & out nodes (i.e. ops) should be placed on host, and all other nodes placed on device by default. along with user-specified constraints.
  2. Predict the area consumption of each node (i.e. ops), and check whether the area limit is satisfied
  3. If not, iterate each neighboring nodes {Vi} of those nodes marked as host. If g(vi) > 0, which means that we can save much on-chip resources or reduce the data transfer size without losing too much overall performance, then we move that node from device scope to host.
  4. do step 3 until the area requirement is met
zhangzhiru commented 3 years ago

Is the basic user-directed partitioning method working reliably now? Auto SW/HW partitioning is a very hard problem in general because of the difficulty of predicting on-FPGA area and performance. We should optimize the basic one first.

hecmay commented 3 years ago

The basic partitioning is working. We have around 5 test cases for that. Yeah, we can add more test cases to ensure that is robust enough first.