Closed dginev closed 5 years ago
whatlang
We arrive at a (near final?) 10.5 million paragraphs dataset of paragraphs with "scientific statement" annotations:
I will make sure I have a working benchmark using Keras generators on top of this dataset, and validate the sanity of the data (with a known model with decent success rate).
Will attach a confusion matrix to this issue, and begin preparing a dataset release (+ statistics), when that is complete.
For now I follow the ML community's approach of keeping "noteworthy" experiments in separate repositories, for this line of work you can find the code here.
Regenerating a controlled dataset with all math lexemes removed (with the SHA256-based file naming) reduces the data from 10.5 to 10.1 million paragraphs, as follows:
Added: all steps in paragraph extraction are now aware of a discard_math
flag, so that one can generate a control dataset with all mathematical content skipped over, in order to compare the impact of using formula lexemes to model success rate.
Some tasks that remain to be done for me to extract and publish a first version of a
other
class, as it is unreliable (its content is very heterogeneous, and may intersect with the main classes). Update: a lot of the "bias-based" groupings I made to arrive at 28 classes look "too certain" upon review. It may be best to allow the model itself to discriminate between which classes are separable and which confusable, rather than blur the line myself too far beforehand. So this got relaxed down to 50 classes.2.Statistical reports
[x] Should try to remove non-English entries to denoise further.
[x] TODO for v2019 Should be rerun with math lexeme patches merged in latexml #1131
[x] Should include end-of-sentence markers. I would like that to be a new unique token just to ensure there is no confusion with formula lexemes (e.g. if we use a dot char), so far I've thought of
ENDSENT
and used that in our doc2vec experiments. This may aid comparisons with sentence-recognizing methods such as HANs. DONE: The line-breaks at sentence ends are a good representation, one can add explicit tokens in the experiment-specific preprocessing.[x] are all paragraphs unique? (SHA filename ensures they are)
[x] Shuffle the final paragraphs resource, so that we lose any ordering relationship to the documents of origin. This way we can distribute the resource as derivative + fair use over arXiv, and avoid the NDA restrictions.
[x] Include main document "abstract", for all documents (not only AMS-marked up)
I consider the llamapun repository to be a perfect home for "dataset extraction" tools / APIs.