Closed jfschaefer closed 6 years ago
I will try to rebase and merge this in, so that we have it integrated before it diverges further. It's a shame it has sat around as a PR for this long as it is, definitely belongs in master.
Will ltest around as well.
As I mentioned in #11 , this PR is now moved to a new branch that is local to the repository origin - apparently working with two masters (one origin, one on Frederik's fork) is too confusing for me. Named branches really make this simpler!
Summary
I implemented a new pattern matching library, based on the insights from my bachelor thesis. It can be used to match phrases, words and math formulae. At the moment, the patterns are written as an XML file. To test the new library, I implemented a new declaration spotter.
The PR also adds missing support for XML namespaces to the serialization code.
The Pattern Language
The patterns are written in an XML file (as in this example). A pattern file essentially contains a list of rules, which can reference each other. I will try to provide an overview of how these rules look like. One day I might write a proper documentation.
Here is an example rule:
This creates a rule for matching words. It has a name so that we can reference it later. The
meta
node is optional and currently does not support much metadata. Afterwards, we have the actual pattern that is matched by this rule. In this case, it is aword_or
pattern, which matches a word, if any of the contained word patterns matches.Here is a second word rule, referencing this rule:
There exist the following types of rules:
mtext_rule
for matching the symbols inmath
nodesmath_rule
for matchingmath
nodes (or parts of them)pos_rule
for matching part-of-speech (POS) tagsword_rule
for matching wordsseq_rule
for matching sequence of wordsHere is a more advanced example of two math rules that match an identifier using mutual recursion:
For consistency, every pattern starts with a prefix, denoting what it matches. The only exception is the
phrase
pattern. It obviously matches sequences of words. Here is another example pattern that illustrates how thephrase
pattern can be used and how patterns of different types can be combined:Markers
Now we can use these rules to find e.g. declarations in a document. However, we'd also be interested in identifying the components of this declaration (introduced identifier, restrictions, ...). For this purpose, we can add markers to our patterns. Here is a rule that matches and marks simple formulas that introduce and restrict identifiers like in $a \in M$ or $x \ge 0$:
A marker has a name and optionally a list of tags associated with it. Markers can also be added to words and sequences of words. However, they are processed differently internally, as they correspond to ranges in the DNM, while math markers correspond to nodes in the DOM.
Currently, the only way to use the rules is by calling a
match_sentence
function, which takes a sentence and a seq_rule name and returns a list of all matches in that sentence. A match is contains the matched markers as a tree structure.Insights From The Example Declaration Spotter
Using this pattern file, I created a small example spotter to test the pattern matching library. As KAT doesn't support string offsets yet, I simply exported the results into an HTML file (attached as ZIP, because github didn't let me attach html). For simplicity, I ignored the tree structure of the resulting matches.
Insights: