KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

AMSthm paragraph models #16

Closed dginev closed 6 years ago

dginev commented 6 years ago

Still undergoing testing and preparing an actual bi-LSTM experiment, but once I conduct some basic quality control, should be good to merge.

The main idea here is to use our llamapun paragraph iterators to find documents with ltx_theorem markup and extract a labeled data set with (paragraph plain text, AMS class) pairs.

I am using the same preprocessing and normalization I used for the word embeddings for now, and have manually surveyed the over 20,000 author-specified \newtheorem{class} values, reducing them to 23 sensible ones for a quick first test.

dginev commented 6 years ago

Generated via this PR:

AMS paragraph model finished in 87387s, gathered: 269,746 documents; 29,425,347 paragraphs; 52,790 discarded paragraphs (long words) 114 GB unpacked

Paragraph classification (23 class normalization):

other: 19,684,711
proof: 4,213,438
theorem: 1,239,271
lemma: 1,166,141
proposition: 861,704
definition: 675,027
remark: 644,450
corollary: 387,307
example: 342,996
notation: 51,086
conjecture: 40,019
problem: 27,483
assumption: 23,074
fact: 17,654
algorithm: 13,752
step: 10,850
question: 7,101
case: 4,276
acknowledgement: 4,216
paragraph: 3,403
condition: 3,355
result: 2,664
caption: 1,369
---
total: 29,425,347