Paragraph classification dataset, first public release

dginev commented 5 years ago

Some tasks that remain to be done for me to extract and publish a first version of a

AMS-statement paragraph classification dataset (arXMLiv 08.2018)

$ cargo run --release --example corpus_ams_para_model

[x] Should contain only the first paragraph, of the 28 classes shortlisted by my manual survey. Should not contain the other class, as it is unreliable (its content is very heterogeneous, and may intersect with the main classes). Update: a lot of the "bias-based" groupings I made to arrive at 28 classes look "too certain" upon review. It may be best to allow the model itself to discriminate between which classes are separable and which confusable, rather than blur the line myself too far beforehand. So this got relaxed down to 50 classes.

2.Statistical reports

[x] not essential, skip the domains of the paragraphs, in terms of arXiv subject classification. If we bookkeep the paragraph counts per class + per document used, we can then induce the report of how many entries of each class occur in each subclass of arXiv.
[x] can be followup is there overlap between classes? (same filename+content between subdirectories)
[x] can be followup average number of sents per paragraph, average number of words per paragraph
[x] can be followup median number of sents per paragraph, median number of words per paragraph

[x] Should try to remove non-English entries to denoise further.
[x] TODO for v2019 Should be rerun with math lexeme patches merged in latexml #1131
[x] Should include end-of-sentence markers. I would like that to be a new unique token just to ensure there is no confusion with formula lexemes (e.g. if we use a dot char), so far I've thought of ENDSENT and used that in our doc2vec experiments. This may aid comparisons with sentence-recognizing methods such as HANs. DONE: The line-breaks at sentence ends are a good representation, one can add explicit tokens in the experiment-specific preprocessing.
[x] are all paragraphs unique? (SHA filename ensures they are)
[x] Shuffle the final paragraphs resource, so that we lose any ordering relationship to the documents of origin. This way we can distribute the resource as derivative + fair use over arXiv, and avoid the NDA restrictions.
- Expanding on this, if we use a big sha #33 over the content as the original filename,
- one can simply do a follow-up renaming pass over each directory, switching the shas into an autoincremented id.
[x] Include main document "abstract", for all documents (not only AMS-marked up)

I consider the llamapun repository to be a perfect home for "dataset extraction" tools / APIs.

dginev commented 5 years ago

Upon revisiting the manual survey categorization, there is definitely something to be said about decoupling more of the classes and allowing a model-derived "confusion matrix" to inform the similarity decisions I did with my apriori bias. arXiv's data can be rather wild.

dginev commented 5 years ago

expanding to 50 (arguably at least partially overlapping in terms of "kinds of content") classes,
and rescanning all arXMLiv documents (for structured environments such as "related work", on top of the AMS annotations),
filtering out paragraphs that were deemed confidently non-English by whatlang
also eliminating the "other" class from the dataset (reducing noise)
using sha256 for the filename, to ensure uniqueness in the dataset (per-class), and auto-shuffling the extracted paragraphs, to avoid any claims of redistributing the original papers.

We arrive at a (near final?) 10.5 million paragraphs dataset of paragraphs with "scientific statement" annotations:

dginev commented 5 years ago

I will make sure I have a working benchmark using Keras generators on top of this dataset, and validate the sanity of the data (with a known model with decent success rate).

Will attach a confusion matrix to this issue, and begin preparing a dataset release (+ statistics), when that is complete.

For now I follow the ML community's approach of keeping "noteworthy" experiments in separate repositories, for this line of work you can find the code here.

dginev commented 5 years ago

Regenerating a controlled dataset with all math lexemes removed (with the SHA256-based file naming) reduces the data from 10.5 to 10.1 million paragraphs, as follows:

dginev commented 5 years ago

Added: all steps in paragraph extraction are now aware of a discard_math flag, so that one can generate a control dataset with all mathematical content skipped over, in order to compare the impact of using formula lexemes to model success rate.

KWARC / llamapun

Paragraph classification dataset, first public release #32