KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

Improve token model normalization #13

Closed dginev closed 6 years ago

dginev commented 6 years ago
  1. We should map all numbers to a fixed token as accepted in the literature e.g. NUM

  2. Sometimes the word segmentation fails for normalized elements (such as mathformula and citationelement) - we should do the more expensive substring check and collapse them to the canonical replacement. E.g. emathformulao -> mathformula

  3. Drop/do not include sentences with words longer than 30 chars. latexml conversions sometimes mis-interpret math mode and eat whitespaces of real sentences, leading to incorrect huge words. From what I have seen the entire sentence is corrupted and only pollutes the model.