KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

Paragraph classification dataset, first public release #32

Closed dginev closed 5 years ago

dginev commented 5 years ago

Some tasks that remain to be done for me to extract and publish a first version of a

AMS-statement paragraph classification dataset (arXMLiv 08.2018)

$ cargo run --release --example corpus_ams_para_model
  1. [x] Should contain only the first paragraph, of the 28 classes shortlisted by my manual survey. Should not contain the other class, as it is unreliable (its content is very heterogeneous, and may intersect with the main classes). Update: a lot of the "bias-based" groupings I made to arrive at 28 classes look "too certain" upon review. It may be best to allow the model itself to discriminate between which classes are separable and which confusable, rather than blur the line myself too far beforehand. So this got relaxed down to 50 classes.

2.Statistical reports

  1. [x] Should try to remove non-English entries to denoise further.

  2. [x] TODO for v2019 Should be rerun with math lexeme patches merged in latexml #1131

  3. [x] Should include end-of-sentence markers. I would like that to be a new unique token just to ensure there is no confusion with formula lexemes (e.g. if we use a dot char), so far I've thought of ENDSENT and used that in our doc2vec experiments. This may aid comparisons with sentence-recognizing methods such as HANs. DONE: The line-breaks at sentence ends are a good representation, one can add explicit tokens in the experiment-specific preprocessing.

  4. [x] are all paragraphs unique? (SHA filename ensures they are)

  5. [x] Shuffle the final paragraphs resource, so that we lose any ordering relationship to the documents of origin. This way we can distribute the resource as derivative + fair use over arXiv, and avoid the NDA restrictions.

    • Expanding on this, if we use a big sha #33 over the content as the original filename,
    • one can simply do a follow-up renaming pass over each directory, switching the shas into an autoincremented id.
  6. [x] Include main document "abstract", for all documents (not only AMS-marked up)

I consider the llamapun repository to be a perfect home for "dataset extraction" tools / APIs.

dginev commented 5 years ago
  1. Upon revisiting the manual survey categorization, there is definitely something to be said about decoupling more of the classes and allowing a model-derived "confusion matrix" to inform the similarity decisions I did with my apriori bias. arXiv's data can be rather wild.
dginev commented 5 years ago

We arrive at a (near final?) 10.5 million paragraphs dataset of paragraphs with "scientific statement" annotations:

Label | Frequency -- | -- proof | 2,125,750 lemma | 1,320,646 theorem | 1,287,653 abstract | 1,030,774 proposition | 829,068 introduction | 688,530 definition | 686,717 remark | 639,038 corollary | 436,768 example | 295,152 conclusion | 284,585 result | 239,931 acknowledgement | 162,230 discussion | 116,650 claim | 89,737 method | 50,970 conjecture | 44,893 problem | 30,369 assumption | 29,577 question | 27,240 related work | 26,300 demonstration | 23,043 observation | 18,776 fact | 17,737 notation | 16,611 overview | 11,279 step | 6,910 note | 4,462 condition | 3,950 case | 3,256 convention | 2,176 keywords | 1,565 rule | 775 constraint | 753 exercise | 404 comment | 325 principle | 236 criterion | 236 solution | 163 experiment | 154 summary | 117 bound | 47 issue | 41 answer | 40 affirmation | 36 explanation | 16 expectation | 13 hint | 9 expansion | 5 notice | 4
dginev commented 5 years ago

I will make sure I have a working benchmark using Keras generators on top of this dataset, and validate the sanity of the data (with a known model with decent success rate).

Will attach a confusion matrix to this issue, and begin preparing a dataset release (+ statistics), when that is complete.

For now I follow the ML community's approach of keeping "noteworthy" experiments in separate repositories, for this line of work you can find the code here.

dginev commented 5 years ago

Regenerating a controlled dataset with all math lexemes removed (with the SHA256-based file naming) reduces the data from 10.5 to 10.1 million paragraphs, as follows:

Label | Frequency (no math) -- | -- proof (nomath) | 2,096,643 theorem (nomath) | 1,212,035 lemma (nomath) | 1,162,557 abstract (nomath) | 1,030,689 proposition (nomath) | 763,268 introduction (nomath) | 688,187 definition (nomath) | 667,796 remark (nomath) | 635,180 corollary (nomath) | 402,728 example (nomath) | 289,003 conclusion (nomath) | 284,536 result (nomath) | 239,639 acknowledgement (nomath) | 162,220 discussion (nomath) | 116,643 claim (nomath) | 75,778 method (nomath) | 50,947 conjecture (nomath) | 41,779 problem (nomath) | 29,220 assumption (nomath) | 26,890 question (nomath) | 26,673 relatedwork (nomath) | 26,298 demonstration (nomath) | 22,842 observation (nomath) | 18,013 fact (nomath) | 16,473 notation (nomath) | 16,077 overview (nomath) | 11,277 step (nomath) | 6,536 note (nomath) | 4,415 condition (nomath) | 3,508 case (nomath) | 2,208 convention (nomath) | 2,160 keywords (nomath) | 1,565 constraint (nomath) | 731 rule (nomath) | 712 exercise (nomath) | 404 comment (nomath) | 322 principle (nomath) | 232 criterion (nomath) | 219 experiment (nomath) | 153 solution (nomath) | 144 summary (nomath) | 117 answer (nomath) | 39 bound (nomath) | 37 issue (nomath) | 28 affirmation (nomath) | 22 explanation (nomath) | 16 expectation (nomath) | 13 hint (nomath) | 9 notice (nomath) | 4 expansion (nomath) | 2
dginev commented 5 years ago

Added: all steps in paragraph extraction are now aware of a discard_math flag, so that one can generate a control dataset with all mathematical content skipped over, in order to compare the impact of using formula lexemes to model success rate.