Closed dginev closed 6 years ago
Generated via this PR:
AMS paragraph model finished in 87387s, gathered: 269,746 documents; 29,425,347 paragraphs; 52,790 discarded paragraphs (long words) 114 GB unpacked
Paragraph classification (23 class normalization):
other: 19,684,711
proof: 4,213,438
theorem: 1,239,271
lemma: 1,166,141
proposition: 861,704
definition: 675,027
remark: 644,450
corollary: 387,307
example: 342,996
notation: 51,086
conjecture: 40,019
problem: 27,483
assumption: 23,074
fact: 17,654
algorithm: 13,752
step: 10,850
question: 7,101
case: 4,276
acknowledgement: 4,216
paragraph: 3,403
condition: 3,355
result: 2,664
caption: 1,369
---
total: 29,425,347
Still undergoing testing and preparing an actual bi-LSTM experiment, but once I conduct some basic quality control, should be good to merge.
The main idea here is to use our llamapun paragraph iterators to find documents with
ltx_theorem
markup and extract a labeled data set with(paragraph plain text, AMS class)
pairs.I am using the same preprocessing and normalization I used for the word embeddings for now, and have manually surveyed the over 20,000 author-specified
\newtheorem{class}
values, reducing them to 23 sensible ones for a quick first test.