Closed dginev closed 5 years ago
Here is a preliminary report of running this PR over the latest dataset. The numbers include three low-volume classes I've decided to additionally discard (convention
, keywords
, note
), and I am keeping as a first experiment two of the classes that are under the 10k threshold but look interesting to model - theory
and data set
. For a total of 46 classes and 22,107,181 distinct paragraphs.
Run details (updated):
1,374,539 Total traversed documents;
379,928 AMS marked up documents;
22,718,502 paragraphs;
859,874 discarded paragraphs (irregular word count or word length)
22,107,217 paragraphs written to .tar destination (discarded duplicate SHA256-based filenames)
---
AMS paragraph model finished in 18054s.
Counts of included classes (threshold cutoff was at 10,000 instances, with the two exceptions of theory
and data set
):
1,167,923 abstract
680,991 acknowledgement
120,661 analysis
13,212 application
40,409 assumption
34,819 background
7,098,238 caption
15,058 case
94,910 claim
511,117 conclusion
46,124 condition
64,350 conjecture
29,205 contribution
493,600 corollary
10,589 data
9,738 dataset
844,670 definition
24,984 demonstration
25,337 description
192,629 discussion
390,229 example
120,689 experiment
20,846 fact
12,263 future work
10,849 implementation
1,056,110 introduction
1,513,073 lemma
119,913 methods
343,543 model
16,887 motivation
69,567 notation
70,621 observation
68,695 preliminaries
126,985 problem
2,719,458 proof
65,284 property
940,306 proposition
39,777 question
54,910 related work
797,994 remark
299,991 result
59,396 simulation
14,255 step
139,725 summary
1,510,103 theorem
7,184 theory
Decided to discard:
2,469 convention
2,344 keywords
5,024 note
Edit: there are obviously many more keywords in arXiv, and the low count is due to using the (very strange and unusual) \newtheorem{keywords}
environments from old AMS-based articles. The latexml 0.8.4 markup for keywords seems to be more reliably <div class="ltx_keywords">
, which llamapun isn't currently looking at. They can be re-obtained if one decided to. A bit torn if it's worth the rerun since they were never of explicit interest, as they are not a real statement
per-se. Metadata is - as the name implies - outside of the main discourse, so I also did not include references, subject (e.g. MSC, AMS, PAC), appendix etc. which are high volume in the corpus. The data can be re-mined at any point, takes ~5 hours with the extra headings normalization, so not much lost.
Rerunning the statement extraction with the latest state and will merge here :+1:
Numbers updated, content looks good. Thanks mostly to captions, we're now looking at a new dataset of 22.1 million statements in 46 classes, for the 2019 edition of arXMLiv.
Following up on the quick headings roundup for the 2019 dataset, I am preparing a changeset that expands the covered headings to all high-frequency items in the corpus, be they structural headings (abstract, caption, section, subsection...) or AMS-annotated ones.
This lead to a bit of additional cleanup, and some extra guards:
There is an associated code refactor with this many changes, and I am now running over the full arXMLiv 2019 dataset to validate the code is working as expected/extracting all relevant items.