Paragraph dataset, first official release

KWARC / llamapun

common language and mathematics processing algorithms, in Rust

https://kwarc.info/systems/llamapun/

GNU General Public License v3.0

25 stars 6 forks source link

Paragraph dataset, first official release #34

Closed dginev closed 5 years ago

dginev commented 5 years ago

Fixes #32 .

There are major improvements to controlling quality and de-noising the paragraphs extracted for the "AMS/mathematical statement" classification task. More details in the issue. This PR has already produced a dataset of 10.5 million paragraphs. Finish up downstream benchmarking and sanity checks, before merging here, more details in the issue.

dginev commented 5 years ago

To toot llamapun's performance horn a little here, the regeneration took ~200 minutes, for traversing the entirety of 1.2 arXiv documents and extracting 10.5 million plain-text normalized paragraph entries in a .tar file.

dginev commented 5 years ago

Feeling quite confident in the paragraph dataset(s) at this point, will merge here and mark a minor llamapun release from master.