KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

Canonicalization and hashing for the DOM, under the DNM umbrella #4

Closed dginev closed 8 years ago

dginev commented 8 years ago

Example

Motivation

Focus here is on the "structured" part of our semi-structured documents, i.e. the tree representation of the various fragments, be they formulas, tables, or narrative sections, paragraphs etc.

Concretely, there is a long-standing challenge to consistently do cross-formula operations on our MathML trees, starting from the simplest equality comparisons, and going deeper into compacting parse trees, and doing advanced substitutions based on content. The task has two parts - canonical representations and hashing into unique identifiers.

For the canonical representations, I implemented the simplest footprint that makes sense here, which is using node names, class attributes, and the text content, ignoring everything else from the markup. I also ignore annotation nodes for math, and unwrap the semantics element, so that the canonical representation of a formula at the moment is the presentation MathML tree, which is the most reliable bit we have. The end goal is that while you can write a latex equation in many equivalent ways, we want to have a canonical form that is identical for different latex input, as long as the content is the same. This includes stylistic concerns btw, such as bold italic, colors, as LaTeXML represents all of those via the "class" attribute of the nodes.

I won't go into the advanced concerns of comparing XML - there are a lot of ways to write down an XML document with the same "content payload", so any naive diff will fail miserably. Because the format is so complex, having a very focused canonicalization pass, which discards almost everything from the markup (namespaces, comment nodes, CDATA nodes, ...) is essential to do reliably comparisons (and hashing) focused on content. I even went as far as to sort the inner values of the class attribute, just to be certain that comparisons are invariant to permuting the values.

For the hashing, I use the simplest MD5 on the canonical string of a node. It's fast and collisions are still highly unlikely on a dataset of the size of arXiv. In a way this is the easy part, c14n is harder.

What we can now do with this PR merged is to efficiently build a formula model from the arXiv dataset, and have a grasp on the number of distinct expressions. I'm curious to see if there are any MD5 collisions on arXiv, and to get a frequency model of math expressions.


@jfschaefer and @urabenstein, feel free to review and suggest changes. I'll leave the PR open for a couple of days.