KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

Added New Pattern Matching Library #8

Closed jfschaefer closed 6 years ago

jfschaefer commented 7 years ago

Summary

I implemented a new pattern matching library, based on the insights from my bachelor thesis. It can be used to match phrases, words and math formulae. At the moment, the patterns are written as an XML file. To test the new library, I implemented a new declaration spotter.

The PR also adds missing support for XML namespaces to the serialization code.

The Pattern Language

The patterns are written in an XML file (as in this example). A pattern file essentially contains a list of rules, which can reference each other. I will try to provide an overview of how these rules look like. One day I might write a proper documentation.

Here is an example rule:

    <word_rule name="indefinite article">
        <meta>
            <description>
                Matches an indefinite article. Only covers lower case.
            </description>
        </meta>
        <word_or>
            <word>a</word>
            <word>an</word>
            <word>some</word>
            <word>any</word>
        </word_or>
    </word_rule>

This creates a rule for matching words. It has a name so that we can reference it later. The meta node is optional and currently does not support much metadata. Afterwards, we have the actual pattern that is matched by this rule. In this case, it is a word_or pattern, which matches a word, if any of the contained word patterns matches.

Here is a second word rule, referencing this rule:

    <word_rule name="article">
        <word_or>
            <word>the</word>
            <word>this</word>
            <word_ref ref="indefinite article" />
        </word_or>
    </word_rule>

There exist the following types of rules:

Here is a more advanced example of two math rules that match an identifier using mutual recursion:

    <math_rule name="identifier">
        <math_or>
            <math_node name="mi">
                <mtext_ref ref="identifier symbol" />
            </math_node>
            <math_ref ref="indexed identifier" />
        </math_or>
    </math_rule>

    <math_rule name="indexed identifier">
        <math_or>
            <math_node name="msub">
                <math_children match_type="starts_with">
                    <math_ref ref="identifier" />
                </math_children>
            </math_node>
            <math_node name="msup">
                <math_children match_type="starts_with">
                    <math_ref ref="identifier" />
                </math_children>
            </math_node>
            <math_node name="msubsup">
                <math_children match_type="starts_with">
                    <math_ref ref="identifier" />
                </math_children>
            </math_node>
        </math_or>
    </math_rule>

For consistency, every pattern starts with a prefix, denoting what it matches. The only exception is the phrase pattern. It obviously matches sequences of words. Here is another example pattern that illustrates how the phrase pattern can be used and how patterns of different types can be combined:

    <phrase tag="NP">   <!-- noun phrase -->
        <match_type>shortest</match_type>
        <starts_with_seq containment="lessorequal">
            <seq_seq>
                <seq_word><word_ref ref="indefinite article"/></seq_word>
            </seq_seq>
        </starts_with_seq>
        <ends_with_seq>
            <seq_word>
                <word_math>
                    <math_or>
                        <math_ref ref="identifier" />
                        <math_ref ref="identifier sequence" />
                    </math_or>
                </word_math>
            </seq_word>
        </ends_with_seq>
    </phrase>

Markers

Now we can use these rules to find e.g. declarations in a document. However, we'd also be interested in identifying the components of this declaration (introduced identifier, restrictions, ...). For this purpose, we can add markers to our patterns. Here is a rule that matches and marks simple formulas that introduce and restrict identifiers like in $a \in M$ or $x \ge 0$:

    <math_rule name="single identifier restricted">
        <math_marker name="restriction" tags="math,introducing_identifier">
            <math_node name="mrow">
                <math_children match_type="exact">
                    <math_marker name="identifier">
                        <math_ref ref="identifier"/>
                    </math_marker>
                    <math_node name="mo">
                        <mtext_ref ref="relation" />
                    </math_node>
                    <math_or>
                        <math_node name="mrow" />
                        <math_ref ref="identifier" />
                    </math_or>
                </math_children>
            </math_node>
        </math_marker>
    </math_rule>

A marker has a name and optionally a list of tags associated with it. Markers can also be added to words and sequences of words. However, they are processed differently internally, as they correspond to ranges in the DNM, while math markers correspond to nodes in the DOM.

Currently, the only way to use the rules is by calling a match_sentence function, which takes a sentence and a seq_rule name and returns a list of all matches in that sentence. A match is contains the matched markers as a tree structure.

Insights From The Example Declaration Spotter

Using this pattern file, I created a small example spotter to test the pattern matching library. As KAT doesn't support string offsets yet, I simply exported the results into an HTML file (attached as ZIP, because github didn't let me attach html). For simplicity, I ignored the tree structure of the resulting matches.

Insights:

dginev commented 6 years ago

I will try to rebase and merge this in, so that we have it integrated before it diverges further. It's a shame it has sat around as a PR for this long as it is, definitely belongs in master.

Will ltest around as well.

dginev commented 6 years ago

As I mentioned in #11 , this PR is now moved to a new branch that is local to the repository origin - apparently working with two masters (one origin, one on Frederik's fork) is too confusing for me. Named branches really make this simpler!