Ravenslofty / blog

The Critical Path - a rambly FPGA blog
49 stars 2 forks source link

26/10/2020: A LUT Mapper (II). #4

Open Ravenslofty opened 4 years ago

Ravenslofty commented 4 years ago

LUT Mapping with Priority Cuts (II)

Some snippets are going to be in diff format, because I find that helps show the changes more. On the other hand, syntax highlighting dies.

The mapper works, but the results are mediocre. Let's fix that.

Depth-optimality

If you locally pick the lowest-depth solution at each node, the global solution found by the cut selector will also be the lowest-depth solution. The depth of a cut is (for now) the maximum of the depth of its inputs plus one level.

/// An AND-Inverter Graph.
/// Populating this graph is out of scope for this article, but see the AIGER format if you want more.
struct Aig {
    /// The primary inputs of the graph.
    inputs: Vec<u32>,
    /// The primary outputs of the graph.
    outputs: Vec<u32>,
    /// The graph nodes.
    nodes: Vec<AigNode>,
+   /// The depth of the best cut at each node.
+   depth: Vec<i32>,
}

impl Aig {
+   /// Returns the depth of a cut.
+   pub fn calculate_depth(&self, cut: &Cut) -> i32 {
+       cut.inputs.iter().map(|node| self.depth[node as usize]).max().unwrap() + 1
+   }
+
    /// Return cuts in the graph that have at most `max_inputs` inputs.
    pub fn enumerate_cuts(&self, max_inputs: usize) -> Vec<Vec<Cut>> {
        // Mapping to 1-input LUTs doesn't make any sense.
        assert!(max_inputs >= 2);

        let mut cuts = vec![vec![]; self.nodes.len()];

        // Every primary input has exactly one cut: the trivial cut of that input.
+       // The depth of a primary input is zero.
        for input in &self.inputs {
            cuts[*input as usize] = vec![Cut::new(vec![*input], *input)];
+           self.depth[*input as usize] = 0;
        }

        // Writing a topological ordering is also out of scope, but see the Wikipedia article on it.
        let topological_ordering = self.topological_ordering();

        for node in topological_ordering {
            // Skip primary inputs.
            if self.inputs.contains(&node) {
                continue;
            }

            // Convenience variables for the children nodes.
            let lhs = self.nodes[node as usize].lhs_index();
            let rhs = self.nodes[node as usize].rhs_index();

            // Compute the trivial cut of this node.
            // Its inputs are the (sorted) inputs of this node.
            let trivial_cut_inputs = vec![lhs, rhs];
            trivial_cut_inputs.sort();
            // Its output is the output of this node.
            let trivial_cut_output = node;
            let trivial_cut = Cut::new(trivial_cut_inputs, trivial_cut_output);

            // Now for some iterator spam.
            let node_cuts = cuts[lhs as usize].iter()
            // Combine all child cuts together.
            .cartesian_product(&cuts[rhs as usize])
            // Merge the child cuts, using the current node as a root.
            .map(|(lhs, rhs)| lhs.merge(rhs, node))
            // Include the trivial cut of this node too.
            .chain(std::iter::once(trivial_cut))
            // Keep all `max_inputs`-feasible cuts.
            .filter(|cut| cut.input_count() <= max_inputs)
+           // Sort them by depth.
+           .sorted_by(|lhs, rhs| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)))
            // And then collect them into a vector.
            .collect::<Vec<Cut>>();
+
+           // If you somehow have zero cuts at a node there is a bug, because the trivial cut guarantees at least one input cut.
+           assert!(!node_cuts.is_empty());
+
+           let best_cut = &node_cuts[0];
+           // Set the depth of the best cut at this node.
+           depth[node as usize] = self.calculate_depth(best_cut);

            // Set these cuts as the cuts at this node.
            cuts[node as usize] = node_cuts;          
        }

        // Return the set of cuts for the graph.
        cuts
    }
}

Since we picked the first element in the cut selector, this will now always pick a depth-optimal cut. But this also lays the foundation for other things.

Using my own implementation in mignite, I get:

Mapped to 155433 LUTs
LUT0: 0
LUT1: 0
LUT2: 15838
LUT3: 1153
LUT4: 138442
Maximum delay: 8531

This is an optimal-depth solution, but it's huge. Having secured depth-optimality, we now need to aim for reducing solution area.

Area recovery consists of two parts: picking the best area depth-optimal cut, and then later relaxing the depth where it can be achieved without affecting the overall delay.

Picking smaller depth-optimal cuts

Let's start by preferring depth-optimal cuts with fewer inputs, which reduces the number of times a node is covered by multiple LUTs.

impl Aig {
    /// Return cuts in the graph that have at most `max_inputs` inputs.
    pub fn enumerate_cuts(&self, max_inputs: usize) -> Vec<Vec<Cut>> {
        // Mapping to 1-input LUTs doesn't make any sense.
        assert!(max_inputs >= 2);

        let mut cuts = vec![vec![]; self.nodes.len()];

        // Every primary input has exactly one cut: the trivial cut of that input.
        // The depth of a primary input is zero.
        for input in &self.inputs {
            cuts[*input as usize] = vec![Cut::new(vec![*input], *input)];
            self.depth[*input as usize] = 0;
        }

        // Writing a topological ordering is also out of scope, but see the Wikipedia article on it.
        let topological_ordering = self.topological_ordering();

        for node in topological_ordering {
            // Skip primary inputs.
            if self.inputs.contains(&node) {
                continue;
            }

            // Convenience variables for the children nodes.
            let lhs = self.nodes[node as usize].lhs_index();
            let rhs = self.nodes[node as usize].rhs_index();

            // Compute the trivial cut of this node.
            // Its inputs are the (sorted) inputs of this node.
            let trivial_cut_inputs = vec![lhs, rhs];
            trivial_cut_inputs.sort();
            // Its output is the output of this node.
            let trivial_cut_output = node;
            let trivial_cut = Cut::new(trivial_cut_inputs, trivial_cut_output);

            // Now for some iterator spam.
            let node_cuts = cuts[lhs as usize].iter()
            // Combine all child cuts together.
            .cartesian_product(&cuts[rhs as usize])
            // Merge the child cuts, using the current node as a root.
            .map(|(lhs, rhs)| lhs.merge(rhs, node))
            // Include the trivial cut of this node too.
            .chain(std::iter::once(trivial_cut))
            // Keep all `max_inputs`-feasible cuts.
            .filter(|cut| cut.input_count() <= max_inputs)
-           // Sort them by depth.
-           .sorted_by(|lhs, rhs| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)))
+           // Sort them by depth, breaking ties by picking the cut with the fewest inputs.
+           .sorted_by(|lhs, rhs| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)).then_with(|| lhs.input_count().cmp(&rhs.input_count())))
            // And then collect them into a vector.
            .collect::<Vec<Cut>>();

            // If you somehow have zero cuts at a node there is a bug, because the trivial cut guarantees at least one input cut.
            assert!(!node_cuts.is_empty());

            let best_cut = &node_cuts[0];
            // Set the depth of the best cut at this node.
            depth[node as usize] = self.calculate_depth(best_cut);

            // Set these cuts as the cuts at this node.
            cuts[node as usize] = node_cuts;          
        }

        // Return the set of cuts for the graph.
        cuts
    }
}

This helps a lot:

Mapped to 99940 LUTs
LUT0: 0
LUT1: 0
LUT2: 40962
LUT3: 738
LUT4: 58240
Maximum delay: 8531

Required time calculation

Before we implement a heuristic to reduce area, we need to know how much we can increase depth by. We do this by calculating a "required time" for that node, which is the depth of the node minus the distance from the output. You can calculate that using a post-order (or reverse topological - children before node) traversal from primary outputs to primary inputs.

/// An AND-Inverter Graph.
/// Populating this graph is out of scope for this article, but see the AIGER format if you want more.
struct Aig {
    /// The primary inputs of the graph.
    inputs: Vec<u32>,
    /// The primary outputs of the graph.
    outputs: Vec<u32>,
    /// The graph nodes.
    nodes: Vec<AigNode>,
    /// The depth of the best cut at each node.
    depth: Vec<i32>,
+   /// The distance of the node to the most critical output.
+   required: Vec<Option<i32>>,
}

impl Aig {
    /// Select the final mapping of LUTs.
    pub fn select_cuts(&self, max_inputs: usize) -> Vec<Cut> {
        // Calculate the cuts.
        let cuts = self.enumerate_cuts(max_inputs);

        // The mapping frontier is all the nodes which need a cut selected that do not presently have one.
        // We start with the primary outputs of the graph and let the algorithm do its job from there.
        let mut frontier = self.outputs.clone();

        // The selected mappings to return.
        let mut mapping = Vec::new();

        // The nodes which have a selected mapping, to avoid duplicate work for a node.
        let mut mapped_nodes = Vec::new();

        // Pick the next node from the frontier.
        while let Some(node) = frontier.pop() {
            // Pick the first cut as the one to use.
            let cut = cuts[node as usize][0];

            // Add the cut's inputs to the mapping frontier if they are not primary inputs and not already mapped.
            for input in &cut.inputs {
                if !self.inputs.contains(input) && !mapped_nodes.contains(input) {
                    frontier.push(*input);
                }
            }

            // Mark this node as mapped.
            mapped_nodes.push(node);

            // Add the cut to the mapping.
            mapping.push(cut);
        }
+
+       // Calculate the maximum depth of the graph.
+       // Assuming no logic loops, this occurs at one of the primary outputs.
+       let max_depth = self.depth.iter().max().unwrap();
+
+       // Initialise the required times.
+       for node in &mut self.required {
+           *node = Some(max_depth);
+       }
+
+       // Calculate the required time by traversing the graph in postorder, starting from primary outputs and working to primary inputs.
+       // This could be implemented through a depth-first search.
+       let postorder_ordering = self.postorder_ordering();
+       for node in postorder_ordering {
+           // Inputs to this node have a required time less than this node.
+           let required = self.required[node as usize] - 1;
+
+           // Pick the first cut as the one to use.
+           let cut = cuts[node as usize][0];
+
+           // Update the required time of the input nodes if the current required time is stricter than the previous.
+           for input in &cut.inputs {
+               self.required[input as usize] = self.required[input as usize].unwrap().min(required);
+           }
+       }

        mapping
    }
}

Area minimisation

Now, the heuristic we implement is called "area flow", and it attempts to measure the global usefulness of the logic by preferring options which have the smallest implementation area, or the most useful logic.

If this ties, we pick a cut which isn't used as much to increase its efficiency. If that ties we fall back to depth.

/// An AND-Inverter Graph.
/// Populating this graph is out of scope for this article, but see the AIGER format if you want more.
struct Aig {
    /// The primary inputs of the graph.
    inputs: Vec<u32>,
    /// The primary outputs of the graph.
    outputs: Vec<u32>,
    /// The graph nodes.
    nodes: Vec<AigNode>,
    /// The depth of the best cut at each node.
    depth: Vec<i32>,
    /// The distance of the node to the most critical output.
    required: Vec<Option<i32>>,
+   /// The area flow of a node, measuring its usefulness.
+   area_flow: Vec<f32>,
+   /// The number of users of a node.
+   fanout: Vec<u32>,
}

impl Aig {
+   /// Returns the area flow of a cut.
+   pub fn calculate_area_flow(&self, cut: &Cut) -> f32 {
+       (cut.inputs.iter().map(|node| self.area_flow[*node]).sum::<f32>() + 1.0) / (self.fanout[cut.output].max(1) as f32)
+   }
+
+   /// Returns the average input fan-out count of a cut.
+   pub fn calculate_fanin(&self, cut: &Cut) -> f32 {
+       (cut.inputs.iter().map(|node| self.references[*node] as f32).sum::<f32>() / (cut.input_count() as f32)
+   }
+
    /// Return cuts in the graph that have at most `max_inputs` inputs.
-   pub fn enumerate_cuts(&self, max_inputs: usize) -> Vec<Vec<Cut>> {
+   pub fn enumerate_cuts(&self, max_inputs: usize, relax_depth: bool) -> Vec<Vec<Cut>> {
        // Mapping to 1-input LUTs doesn't make any sense.
        assert!(max_inputs >= 2);

        let mut cuts = vec![vec![]; self.nodes.len()];

        // Every primary input has exactly one cut: the trivial cut of that input.
-       // The depth of a primary input is zero.
+       // The depth, fanout and area flow of a primary input is zero.
        for input in &self.inputs {
            cuts[*input as usize] = vec![Cut::new(vec![*input], *input)];
            self.depth[*input as usize] = 0;
+           self.area_flow[*input as usize] = 0.0;
+           self.fanout[*input as usize] = 0;
        }

        // Writing a topological ordering is also out of scope, but see the Wikipedia article on it.
        let topological_ordering = self.topological_ordering();

        for node in topological_ordering {
            // Skip primary inputs.
            if self.inputs.contains(&node) {
                continue;
            }

            // Convenience variables for the children nodes.
            let lhs = self.nodes[node as usize].lhs_index();
            let rhs = self.nodes[node as usize].rhs_index();

            // Compute the trivial cut of this node.
            // Its inputs are the (sorted) inputs of this node.
            let trivial_cut_inputs = vec![lhs, rhs];
            trivial_cut_inputs.sort();
            // Its output is the output of this node.
            let trivial_cut_output = node;
            let trivial_cut = Cut::new(trivial_cut_inputs, trivial_cut_output);

            // Now for some iterator spam.
            let node_cuts = cuts[lhs as usize].iter()
            // Combine all child cuts together.
            .cartesian_product(&cuts[rhs as usize])
            // Merge the child cuts, using the current node as a root.
            .map(|(lhs, rhs)| lhs.merge(rhs, node))
            // Include the trivial cut of this node too.
            .chain(std::iter::once(trivial_cut))
+           // Also include the last-best cut, if we have one.
+           .chain(self.cuts[node.index()].first().cloned())
            // Keep all `max_inputs`-feasible cuts.
            .filter(|cut| cut.input_count() <= max_inputs)
+           // Keep all cuts that meet the node's required time, if it has one.
+           .filter(|cut| self.required[cut.output].is_none() || self.calculate_depth(cut) <= self.required[cut.output].unwrap())
-           // Sort them by depth, breaking ties by picking the cut with the fewest inputs.
-           .sorted_by(|lhs, rhs| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)).then_with(|| lhs.input_count().cmp(&rhs.input_count())))
+           // In depth mode: sort them by depth, breaking ties by picking the cut with the fewest inputs, then by area flow.
+           // In area mode: sort them by area flow, breaking ties by picking the cut with the smallest average fan-in count, then by depth.
+           .sorted_by(|lhs, rhs| {
+               if relax_depth {
+                   self.calculate_area_flow(lhs).partial_cmp(&self.calculate_area_flow(lhs)).unwrap()
+                       .then_with(|| self.calculate_fanin(lhs).partial_cmp(&self.calculate_fanin(lhs)).unwrap())
+                       .then_with(|| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)))
+               } else {
+                   self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs))
+                       .then_with(|| lhs.input_count().cmp(&rhs.input_count())).then_with(|| lhs.input_count().cmp(&rhs.input_count()))
+                       .then_with(|| self.calculate_area_flow(lhs).partial_cmp(&self.calculate_area_flow(lhs)).unwrap())
+               }
+           })
            // And then collect them into a vector.
            .collect::<Vec<Cut>>();

            // If you somehow have zero cuts at a node there is a bug, because the trivial cut guarantees at least one input cut.
            assert!(!node_cuts.is_empty());

            let best_cut = &node_cuts[0];
            // Set the depth of the best cut at this node.
            depth[node as usize] = self.calculate_depth(best_cut);

+           // Update the fan-out and area flow arrays.
+           for input in &best_cut.inputs {
+               self.fanout[*input as usize] += 1;
+               self.area_flow[*input as usize] += self.calculate_area_flow(self.cuts[*input as usize][0]);
+           }
+
            // Set these cuts as the cuts at this node.
            cuts[node as usize] = node_cuts;          
        }

        // Return the set of cuts for the graph.
        cuts
    }
}

Now, if you run a mapping of depth, followed by a mapping of area:

Depth:
Mapped to 98644 LUTs
LUT0: 0
LUT1: 0
LUT2: 40162
LUT3: 720
LUT4: 57762
Maximum delay: 8531

Area:
Mapped to 99815 LUTs
LUT0: 0
LUT1: 0
LUT2: 38752
LUT3: 664
LUT4: 60399
Maximum delay: 8531

While not much of an improvement here, area flow comes into its own when you include generalised delay and area models. (Part 3, maybe?)

Partial cut enumeration

The less patient of you might have noticed that for sufficiently large designs, this algorithm takes forever.

real    1m55.251s

Fortunately, just as exhaustive cut enumeration exists, partial cut enumeration exists, too; we can set a number of cuts to keep per node, which reduces the time we're spending on them.

-   pub fn enumerate_cuts(&self, max_inputs: usize, relax_depth: bool) -> Vec<Vec<Cut>> {
+   pub fn enumerate_cuts(&self, max_inputs: usize, max_cuts: usize, relax_depth: bool) -> Vec<Vec<Cut>> {
        // Mapping to 1-input LUTs doesn't make any sense.
        assert!(max_inputs >= 2);

        let mut cuts = vec![vec![]; self.nodes.len()];

        // Every primary input has exactly one cut: the trivial cut of that input.
        // The depth, fanout and area flow of a primary input is zero.
        for input in &self.inputs {
            cuts[*input as usize] = vec![Cut::new(vec![*input], *input)];
            self.depth[*input as usize] = 0;
            self.area_flow[*input as usize] = 0.0;
            self.fanout[*input as usize] = 0;
        }

        // Writing a topological ordering is also out of scope, but see the Wikipedia article on it.
        let topological_ordering = self.topological_ordering();

        for node in topological_ordering {
            // Skip primary inputs.
            if self.inputs.contains(&node) {
                continue;
            }

            // Convenience variables for the children nodes.
            let lhs = self.nodes[node as usize].lhs_index();
            let rhs = self.nodes[node as usize].rhs_index();

            // Compute the trivial cut of this node.
            // Its inputs are the (sorted) inputs of this node.
            let trivial_cut_inputs = vec![lhs, rhs];
            trivial_cut_inputs.sort();
            // Its output is the output of this node.
            let trivial_cut_output = node;
            let trivial_cut = Cut::new(trivial_cut_inputs, trivial_cut_output);

            // Now for some iterator spam.
            let node_cuts = cuts[lhs as usize].iter()
            // Combine all child cuts together.
            .cartesian_product(&cuts[rhs as usize])
            // Merge the child cuts, using the current node as a root.
            .map(|(lhs, rhs)| lhs.merge(rhs, node))
            // Include the trivial cut of this node too.
            .chain(std::iter::once(trivial_cut))
            // Also include the last-best cut, if we have one.
            .chain(self.cuts[node.index()].first().cloned())
            // Keep all `max_inputs`-feasible cuts.
            .filter(|cut| cut.input_count() <= max_inputs)
            // Keep all cuts that meet the node's required time, if it has one.
            .filter(|cut| self.required[cut.output].is_none() || self.calculate_depth(cut) <= self.required[cut.output].unwrap())
            // In depth mode: sort them by depth, breaking ties by picking the cut with the fewest inputs, then by area flow.
            // In area mode: sort them by area flow, breaking ties by picking the cut with the smallest average fan-in count, then by depth.
            .sorted_by(|lhs, rhs| {
                if relax_depth {
                    self.calculate_area_flow(lhs).partial_cmp(&self.calculate_area_flow(lhs)).unwrap()
                        .then_with(|| self.calculate_area_flow(lhs).partial_cmp(&self.calculate_area_flow(lhs)).unwrap())
                        .then_with(|| self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs)))
                } else {
                    self.calculate_depth(lhs).cmp(&self.calculate_depth(rhs))
                        .then_with(|| lhs.input_count().cmp(&rhs.input_count())).then_with(|| lhs.input_count().cmp(&rhs.input_count()))
                        .then_with(|| self.calculate_area_flow(lhs).partial_cmp(&self.calculate_area_flow(lhs)).unwrap())
                }
            })
+           // Keep only `max_cuts` cuts.
+           .take(max_cuts)
            // And then collect them into a vector.
            .collect::<Vec<Cut>>();

            // If you somehow have zero cuts at a node there is a bug, because the trivial cut guarantees at least one input cut.
            assert!(!node_cuts.is_empty());

            let best_cut = &node_cuts[0];
            // Set the depth of the best cut at this node.
            depth[node as usize] = self.calculate_depth(best_cut);

            // Update the fan-out and area flow arrays.
            for input in &best_cut.inputs {
                self.fanout[*input as usize] += 1;
                self.area_flow[*input as usize] += self.calculate_area_flow(self.cuts[*input as usize][0]);
            }

            // Set these cuts as the cuts at this node.
            cuts[node as usize] = node_cuts;          
        }

        // Return the set of cuts for the graph.
        cuts
    }
}

A good value for max_cuts is 8, which lets you do exhaustive enumeration where it isn't too expensive.

After that small tweak, we have:

Depth:
Mapped to 98626 LUTs
LUT0: 0
LUT1: 0
LUT2: 40218
LUT3: 734
LUT4: 57674
Maximum delay: 8532

Area:
Mapped to 100351 LUTs
LUT0: 0
LUT1: 0
LUT2: 42571
LUT3: 648
LUT4: 57132
Maximum delay: 8532

real    0m21.303s

And now we have the algorithm: you can add other heuristics to prioritise whatever you think is important.