KWARC / llamapun

common language and mathematics processing algorithms, in Rust
https://kwarc.info/systems/llamapun/
GNU General Public License v3.0
25 stars 6 forks source link

Upgrade to SHA-256 for hashing in dataset generation #33

Closed dginev closed 5 years ago

dginev commented 5 years ago

A quick de-noising trick for dataset extraction (be it documents or paragraphs), is to hash the document and maintain a set of unique shas, only including unseen shas in the extracted data.

This will allow us to only include once various artifacts such as documents that are simple "this document has been retracted" messages, and also remove potential trivial duplicates in the paragraph data (e.g. "the proof is left as an exercise to the reader").

The c14n submodule already has an MD5 solution to aid such ideas, but it would be best to upgrade to SHA-256 and worry even less about possible collisions.

dginev commented 5 years ago

Done in #32