MartinThoma / hwrt

A toolset for handwriting recognition
MIT License
69 stars 18 forks source link

Building a language model #37

Open MartinThoma opened 8 years ago

MartinThoma commented 8 years ago

The language model gives the probability of an expression occurring in a mathematical text / paper.

Examples

I'm not too sure which values the following expression should get:

Traditional language models introduce special symbols <s> and </s> for the start and the end of a sentence. Then they use trigram models.

Example

The sentece

the quick red fox jumps over the dog

Has the probability

P((<s>,the, quick)) *
P((the, quick, red)) *
P((quick, red, fox)) *
P((red, fox, jumps)) *
P((fox, jumps, over)) *
P((jumps, over, the)) *
P((over, the, dog)) *
P((the, dog, </s>))

As you can see, any expression which is longer tends to be more unlikely than shorter ones.

Now, how do you get P((<s>,the, quick))? Simply take a big amount of texts (e.g. Wikipedia articles + revisions), count those 3-grams. Then:

P((<s>,the, quick)) = N((<s>,the, quick)) / (Total amount of trigrams with "the" in the middle)

Smoothing (unknown word combinations)

There might be combinations you haven't seen before. They might simply be wrong (structural zero), or they might not occur in your data (contingent zero). For this reason, you should only assign them a low value, but not 0:

P(a,b,c) := P(b | word a before, word c after)
          = (Count((a,b,c)) + k) / Count( (a, *, b) + k*|V|^2)
where V is the vocabulary

This is called "Laplace Smoothing" or "add-k estimation".

The unigram prior is

P(b | word a before, word c after) = (Count(a, b, c) + m P(b)) / (Count(a, *, b) + m)

Other ideas:

In the english language, you can simply split sentences by ".". But for mathematics, the parsing is not that simple. For example, which are the relevant tokens in \frac{1+2}{3}?

I would say something like (<s>), (fraction numerator) (1) (+) (2) (fraction denominator) (3) (</s>) would probably be ok. However, mathematics is - in contrast to natural language - very strongly structured. I would like to use that. So in contrast of making a simple 3-gram model, I would make a 3-gram model with context:

P("\frac{1+2}{3}") = \product_{symbol in expression} P((symbol, context, directly left, directly right))
                  = P(("<s>", "<numerator>", "1", "+", "2", "</numerator><denominator>", "3", "</denominator>", "</s>"))
                  = P(("<numerator>", "", "<s>", "1")) *
                    P(("1", "numerator", "<numerator>", "+")) *
                    P(("+", "numerator", "1", "2")) *
                    P(("2", "numerator", "+", "</numerator><denominator>")) *
                    P(("</numerator><denominator>", "numerator", "2") *
                    P(("3", "denominator", "</numerator><denominator>", "</denominator>"))
                    P(("</denominator>", "denominator", "3", "</s>"))

with

P((symbol, context, directly left, directly right)) = Count((symbol, context, directly left, directly right)) / Count((*, context, directly left, directly right))

Another example would be

P("(2+") = P("(", "", "<s>", "2") *
           P("2", "left round bracket", "(", "+") *
           P("+", "left round bracket", "2", "</s>")

Contexts

Ideas:

Future:

Split data in training and testing set. As an evaluation metric, I can use perplexity (intrinsic) or use "extrinsic evaluation" (in-vivo).

MartinThoma commented 8 years ago

Open questions

MartinThoma commented 8 years ago

See also: https://github.com/MartinThoma/write-math/issues/23

MartinThoma commented 8 years ago

Problems with parsing (La)TeX from arXiv