Closed dginev closed 5 years ago
To toot llamapun's performance horn a little here, the regeneration took ~200 minutes, for traversing the entirety of 1.2 arXiv documents and extracting 10.5 million plain-text normalized paragraph entries in a .tar file.
Feeling quite confident in the paragraph dataset(s) at this point, will merge here and mark a minor llamapun release from master.
Fixes #32 .
There are major improvements to controlling quality and de-noising the paragraphs extracted for the "AMS/mathematical statement" classification task. More details in the issue. This PR has already produced a dataset of 10.5 million paragraphs. Finish up downstream benchmarking and sanity checks, before merging here, more details in the issue.