As you can see, any expression which is longer tends to be more unlikely than shorter ones.
Now, how do you get P((<s>,the, quick))? Simply take a big amount of texts (e.g. Wikipedia articles + revisions), count those 3-grams. Then:
P((<s>,the, quick)) = N((<s>,the, quick)) / (Total amount of trigrams with "the" in the middle)
Smoothing (unknown word combinations)
There might be combinations you haven't seen before. They might simply be wrong (structural zero), or they might not occur in your data (contingent zero). For this reason, you should only assign them a low value, but not 0:
P(a,b,c) := P(b | word a before, word c after)
= (Count((a,b,c)) + k) / Count( (a, *, b) + k*|V|^2)
where V is the vocabulary
This is called "Laplace Smoothing" or "add-k estimation".
The unigram prior is
P(b | word a before, word c after) = (Count(a, b, c) + m P(b)) / (Count(a, *, b) + m)
Other ideas:
Backoff
Good Turing Smoothing
Kneser-Ney Smoothing
Adjusting language models for mathematics
In the english language, you can simply split sentences by ".". But for mathematics, the parsing is not that simple. For example, which are the relevant tokens in \frac{1+2}{3}?
I would say something like (<s>), (fraction numerator) (1) (+) (2) (fraction denominator) (3) (</s>) would probably be ok. However, mathematics is - in contrast to natural language - very strongly structured. I would like to use that. So in contrast of making a simple 3-gram model, I would make a 3-gram model with context:
\frac{numerator}{denumerator}: triggers numerator and denumerator
Brackets / bracket-like:
( triggers "left round bracket", ) closes it
[ triggers "left square bracket", ] closes it
{ triggers "left curly bracket", } closes it
\lfloor triggers "left floor", \rfloor closes it
\lceil triggers "left floor", \rceil closes it
\int triggers "integral", d and \mathrm{d} closes it
^{superscript} triggers "superscript"
_{subscript} triggers "subscript"
\sqrt[n]{root} triggers "root" and "root exponent"
Ideas:
\tilde, \bar, \hat, \dot, \ddot, \not
\lim, \sum, \prod, \exists, \forall (what closes it? Next LaTeX block, except if there are brackets?)
= triggers "left equals", "right eqauls"? -> Problem: What is "last" context in "(2=x)"? The first one evaluated, hence "="? Might get difficult.
\underbrace / \overbrace?
Should P( trigger a "probability" context?
Future:
matrices: They should not be supported by now.
Normalizations
Some symbols, like \mathds{Z} and \mathbb{Z}, might be used synonymous, depending on the authors preference. That should probably be normalized to the same symbol.
Other symbols, like 1,2,3,4,5,6,7,8,9 are probably interchangably usable. They should probably be normalized to NON-ZERO DIGIT. They are, of course, different, but that difference can easily be detected by the movement model (aka acoustic model in ASR)
Data sources
Wikipedia (<math>...</math>)
arXiv ($...$ and \[ ... \] - for now, I only want single line formulas. That's complicated enough.)
Evaluation
Split data in training and testing set. As an evaluation metric, I can use perplexity (intrinsic) or use "extrinsic evaluation" (in-vivo).
The language model gives the probability of an expression occurring in a mathematical text / paper.
Examples
4 +
should be low, as+
is a binary operator4 + \cdot
should be very low, as+
is an operator and\cdot
is another operator (andcdot
is a binary operator`)\int
should be low, as it is usually followed by something\int dx
should be low, as there is usually something in betweenP(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}
should be very high as this is Bayes' theorem.([adfasdf)
should be low, as this is a wrong stacking of brackets(4+5))
should be low as this closes a bracket which wasn't openedI'm not too sure which values the following expression should get:
1+2
: It is valid, but probably not occurring in any paper on arXivTraditional language models
Traditional language models introduce special symbols
<s>
and</s>
for the start and the end of a sentence. Then they use trigram models.Example
The sentece
Has the probability
As you can see, any expression which is longer tends to be more unlikely than shorter ones.
Now, how do you get
P((<s>,the, quick))
? Simply take a big amount of texts (e.g. Wikipedia articles + revisions), count those 3-grams. Then:Smoothing (unknown word combinations)
There might be combinations you haven't seen before. They might simply be wrong (structural zero), or they might not occur in your data (contingent zero). For this reason, you should only assign them a low value, but not 0:
This is called "Laplace Smoothing" or "add-k estimation".
The unigram prior is
Other ideas:
Adjusting language models for mathematics
In the english language, you can simply split sentences by ".". But for mathematics, the parsing is not that simple. For example, which are the relevant tokens in
\frac{1+2}{3}
?I would say something like
(<s>), (fraction numerator) (1) (+) (2) (fraction denominator) (3) (</s>)
would probably be ok. However, mathematics is - in contrast to natural language - very strongly structured. I would like to use that. So in contrast of making a simple 3-gram model, I would make a 3-gram model with context:with
Another example would be
Contexts
\frac{numerator}{denumerator}
: triggers numerator and denumerator(
triggers "left round bracket",)
closes it[
triggers "left square bracket",]
closes it{
triggers "left curly bracket",}
closes it\lfloor
triggers "left floor",\rfloor
closes it\lceil
triggers "left floor",\rceil
closes it\int
triggers "integral",d
and\mathrm{d}
closes it^{superscript}
triggers "superscript"_{subscript}
triggers "subscript"\sqrt[n]{root}
triggers "root" and "root exponent"Ideas:
\tilde
,\bar
,\hat
,\dot
,\ddot
,\not
\lim
,\sum
,\prod
,\exists
,\forall
(what closes it? Next LaTeX block, except if there are brackets?)=
triggers "left equals", "right eqauls"? -> Problem: What is "last" context in "(2=x)"? The first one evaluated, hence "="? Might get difficult.\underbrace
/\overbrace
?P(
trigger a "probability" context?Future:
matrices
: They should not be supported by now.Normalizations
\mathds{Z}
and\mathbb{Z}
, might be used synonymous, depending on the authors preference. That should probably be normalized to the same symbol.1,2,3,4,5,6,7,8,9
are probably interchangably usable. They should probably be normalized toNON-ZERO DIGIT
. They are, of course, different, but that difference can easily be detected by the movement model (aka acoustic model in ASR)Data sources
<math>...</math>
)$...$
and\[ ... \]
- for now, I only want single line formulas. That's complicated enough.)Evaluation
Split data in training and testing set. As an evaluation metric, I can use perplexity (intrinsic) or use "extrinsic evaluation" (in-vivo).