Enhanced statement extraction coverage

dginev commented 5 years ago

Following up on the quick headings roundup for the 2019 dataset, I am preparing a changeset that expands the covered headings to all high-frequency items in the corpus, be they structural headings (abstract, caption, section, subsection...) or AMS-annotated ones.

This lead to a bit of additional cleanup, and some extra guards:

accepted paragraphs now must be between 4 and 1024 words in length
all low-frequency classes we are aware of are not extracted (items such as "hint"). The threshold I am using is 10,000 occurrences + textual (ignoring e.g. references and appendix).
the extracted paragraphs now also include punctuation tokens

There is an associated code refactor with this many changes, and I am now running over the full arXMLiv 2019 dataset to validate the code is working as expected/extracting all relevant items.

dginev commented 5 years ago

Here is a preliminary report of running this PR over the latest dataset. The numbers include three low-volume classes I've decided to additionally discard (convention, keywords, note), and I am keeping as a first experiment two of the classes that are under the 10k threshold but look interesting to model - theory and data set. For a total of 46 classes and 22,107,181 distinct paragraphs.

Run details (updated):

1,374,539 Total traversed documents;
379,928 AMS marked up documents;
22,718,502 paragraphs;
859,874 discarded paragraphs (irregular word count or word length)
22,107,217 paragraphs written to .tar destination (discarded duplicate SHA256-based filenames)
---
AMS paragraph model finished in 18054s.

Counts of included classes (threshold cutoff was at 10,000 instances, with the two exceptions of theory and data set):

1,167,923 abstract
  680,991 acknowledgement
  120,661 analysis
   13,212 application
   40,409 assumption
   34,819 background
7,098,238 caption
   15,058 case
   94,910 claim
  511,117 conclusion
   46,124 condition
   64,350 conjecture
   29,205 contribution
  493,600 corollary
   10,589 data
    9,738 dataset
  844,670 definition
   24,984 demonstration
   25,337 description
  192,629 discussion
  390,229 example
  120,689 experiment
   20,846 fact
   12,263 future work
   10,849 implementation
1,056,110 introduction
1,513,073 lemma
  119,913 methods
  343,543 model
   16,887 motivation
   69,567 notation
   70,621 observation
   68,695 preliminaries
  126,985 problem
2,719,458 proof
   65,284 property
  940,306 proposition
   39,777 question
   54,910 related work
  797,994 remark
  299,991 result
   59,396 simulation
   14,255 step
  139,725 summary
1,510,103 theorem
    7,184 theory

Decided to discard:

    2,469 convention
    2,344 keywords
    5,024 note

Edit: there are obviously many more keywords in arXiv, and the low count is due to using the (very strange and unusual) \newtheorem{keywords} environments from old AMS-based articles. The latexml 0.8.4 markup for keywords seems to be more reliably <div class="ltx_keywords">, which llamapun isn't currently looking at. They can be re-obtained if one decided to. A bit torn if it's worth the rerun since they were never of explicit interest, as they are not a real statement per-se. Metadata is - as the name implies - outside of the main discourse, so I also did not include references, subject (e.g. MSC, AMS, PAC), appendix etc. which are high volume in the corpus. The data can be re-mined at any point, takes ~5 hours with the extra headings normalization, so not much lost.

dginev commented 5 years ago

Rerunning the statement extraction with the latest state and will merge here :+1:

dginev commented 5 years ago

Numbers updated, content looks good. Thanks mostly to captions, we're now looking at a new dataset of 22.1 million statements in 46 classes, for the 2019 edition of arXMLiv.

KWARC / llamapun

Enhanced statement extraction coverage #44