AMSthm paragraph models

KWARC / llamapun

common language and mathematics processing algorithms, in Rust

GNU General Public License v3.0

25 stars 6 forks source link

Still undergoing testing and preparing an actual bi-LSTM experiment, but once I conduct some basic quality control, should be good to merge.

The main idea here is to use our llamapun paragraph iterators to find documents with ltx_theorem markup and extract a labeled data set with (paragraph plain text, AMS class) pairs.

I am using the same preprocessing and normalization I used for the word embeddings for now, and have manually surveyed the over 20,000 author-specified \newtheorem{class} values, reducing them to 23 sensible ones for a quick first test.

other: 19,684,711 proof: 4,213,438 theorem: 1,239,271 lemma: 1,166,141 proposition: 861,704 definition: 675,027 remark: 644,450 corollary: 387,307 example: 342,996 notation: 51,086 conjecture: 40,019 problem: 27,483 assumption: 23,074 fact: 17,654 algorithm: 13,752 step: 10,850 question: 7,101 case: 4,276 acknowledgement: 4,216 paragraph: 3,403 condition: 3,355 result: 2,664 caption: 1,369 --- total: 29,425,347

KWARC / llamapun

AMSthm paragraph models #16