Anders429 / word_filter

A Word Filter for filtering text.
Apache License 2.0
1 stars 0 forks source link

State Minimization #40

Closed Anders429 closed 3 years ago

Anders429 commented 3 years ago

For larger WordFilters, especially those with lots of aliases and separators, there tend to be a lot of identical nodes (for example, many return nodes that have no children). These could theoretically be merged by some kind of minimization algorithm.

Will have to be careful that this doesn't affect the algorithms to add repeated subgraphs during walking, as those rely on uniqueness of node address to ensure we aren't adding the same nodes in a loop. But I don't think it will be a problem, since we should only de-duplicate if the same inputs will result in the same outputs for both nodes.

A simple approach would be to loop through NodeGenerators and find identical ones, merging them together and updating node references to match the change. Then repeat until no identical nodes are found. However, more efficient methods may be possible using DFA Minimization algorithms.

Anders429 commented 3 years ago

The linked DFA minimization things won't directly apply here, since, after much research, it's clear that a WordFilter is best modeled as a push-down automaton (although it only needs to be modeled as a PDA for spatial reasons. It could be modeled as an NFA, with traversals to the separators repeated at every state, and then converted to a DFA with powerset conversion, but the generated code quickly becomes simply too large to reasonably compile due to the large amount of states and traversals that must be defined. A PDA is a much simpler way to model the problem).

However, minimizations can still be done. States with the exact same traversals and type can be merged into a single state, which is especially useful for merging Return states in separators and aliases. The naïve approach is to do this merging in a repeated loop until the number of states stops changing, at which point the PDA is considered "minimized".