Improve token model normalization

We should map all numbers to a fixed token as accepted in the literature e.g. NUM
Sometimes the word segmentation fails for normalized elements (such as mathformula and citationelement) - we should do the more expensive substring check and collapse them to the canonical replacement. E.g. emathformulao -> mathformula
Drop/do not include sentences with words longer than 30 chars. latexml conversions sometimes mis-interpret math mode and eat whitespaces of real sentences, leading to incorrect huge words. From what I have seen the entire sentence is corrupted and only pollutes the model.

KWARC / llamapun