Building a language model

MartinThoma commented 8 years ago

The language model gives the probability of an expression occurring in a mathematical text / paper.

Examples

The probability of the expression 4 + should be low, as + is a binary operator
The probability of the expression 4 + \cdot should be very low, as + is an operator and \cdot is another operator (and cdot is a binary operator`)
The probability of the expression \int should be low, as it is usually followed by something
The probability of the expression \int dx should be low, as there is usually something in between
The probability of the expression P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} should be very high as this is Bayes' theorem.
The probability of the expression ([adfasdf) should be low, as this is a wrong stacking of brackets
The probability of the expression (4+5)) should be low as this closes a bracket which wasn't opened

I'm not too sure which values the following expression should get:

1+2: It is valid, but probably not occurring in any paper on arXiv
Traditional language models

Traditional language models introduce special symbols <s> and </s> for the start and the end of a sentence. Then they use trigram models.

Example

The sentece

the quick red fox jumps over the dog

Has the probability

P((<s>,the, quick)) *
P((the, quick, red)) *
P((quick, red, fox)) *
P((red, fox, jumps)) *
P((fox, jumps, over)) *
P((jumps, over, the)) *
P((over, the, dog)) *
P((the, dog, </s>))

As you can see, any expression which is longer tends to be more unlikely than shorter ones.

Now, how do you get P((<s>,the, quick))? Simply take a big amount of texts (e.g. Wikipedia articles + revisions), count those 3-grams. Then:

P((<s>,the, quick)) = N((<s>,the, quick)) / (Total amount of trigrams with "the" in the middle)

Smoothing (unknown word combinations)

There might be combinations you haven't seen before. They might simply be wrong (structural zero), or they might not occur in your data (contingent zero). For this reason, you should only assign them a low value, but not 0:

P(a,b,c) := P(b | word a before, word c after)
          = (Count((a,b,c)) + k) / Count( (a, *, b) + k*|V|^2)
where V is the vocabulary

This is called "Laplace Smoothing" or "add-k estimation".

The unigram prior is

P(b | word a before, word c after) = (Count(a, b, c) + m P(b)) / (Count(a, *, b) + m)

Other ideas:

Backoff
Good Turing Smoothing
Kneser-Ney Smoothing
Adjusting language models for mathematics

In the english language, you can simply split sentences by ".". But for mathematics, the parsing is not that simple. For example, which are the relevant tokens in \frac{1+2}{3}?

I would say something like (<s>), (fraction numerator) (1) (+) (2) (fraction denominator) (3) (</s>) would probably be ok. However, mathematics is - in contrast to natural language - very strongly structured. I would like to use that. So in contrast of making a simple 3-gram model, I would make a 3-gram model with context:

P("\frac{1+2}{3}") = \product_{symbol in expression} P((symbol, context, directly left, directly right))
                  = P(("<s>", "<numerator>", "1", "+", "2", "</numerator><denominator>", "3", "</denominator>", "</s>"))
                  = P(("<numerator>", "", "<s>", "1")) *
                    P(("1", "numerator", "<numerator>", "+")) *
                    P(("+", "numerator", "1", "2")) *
                    P(("2", "numerator", "+", "</numerator><denominator>")) *
                    P(("</numerator><denominator>", "numerator", "2") *
                    P(("3", "denominator", "</numerator><denominator>", "</denominator>"))
                    P(("</denominator>", "denominator", "3", "</s>"))

with

P((symbol, context, directly left, directly right)) = Count((symbol, context, directly left, directly right)) / Count((*, context, directly left, directly right))

Another example would be

P("(2+") = P("(", "", "<s>", "2") *
           P("2", "left round bracket", "(", "+") *
           P("+", "left round bracket", "2", "</s>")

Contexts

\frac{numerator}{denumerator}: triggers numerator and denumerator
Brackets / bracket-like:
- ( triggers "left round bracket", ) closes it
- [ triggers "left square bracket", ] closes it
- { triggers "left curly bracket", } closes it
- \lfloor triggers "left floor", \rfloor closes it
- \lceil triggers "left floor", \rceil closes it
\int triggers "integral", d and \mathrm{d} closes it
^{superscript} triggers "superscript"
_{subscript} triggers "subscript"
\sqrt[n]{root} triggers "root" and "root exponent"

Ideas:

\tilde, \bar, \hat, \dot, \ddot, \not
\lim, \sum, \prod, \exists, \forall (what closes it? Next LaTeX block, except if there are brackets?)
= triggers "left equals", "right eqauls"? -> Problem: What is "last" context in "(2=x)"? The first one evaluated, hence "="? Might get difficult.
\underbrace / \overbrace?
Should P( trigger a "probability" context?

Future:

matrices: They should not be supported by now.
Normalizations
Some symbols, like \mathds{Z} and \mathbb{Z}, might be used synonymous, depending on the authors preference. That should probably be normalized to the same symbol.
Other symbols, like 1,2,3,4,5,6,7,8,9 are probably interchangably usable. They should probably be normalized to NON-ZERO DIGIT. They are, of course, different, but that difference can easily be detected by the movement model (aka acoustic model in ASR)
Data sources
Wikipedia (<math>...</math>)
arXiv ( $...$ and \[ ... \] - for now, I only want single line formulas. That's complicated enough.)
Evaluation

Split data in training and testing set. As an evaluation metric, I can use perplexity (intrinsic) or use "extrinsic evaluation" (in-vivo).

MartinThoma commented 8 years ago

Open questions

How can I do smoothing with this "context"?

MartinThoma commented 8 years ago

Problems with parsing (La)TeX from arXiv

http://arxiv.org/abs/cond-mat/0007404: \leo was redefined in another file
http://arxiv.org/abs/math/0007209 has a file amstexl.tex which only contains definitions - no \begin{document}

MartinThoma / hwrt