bigscience-workshop / catalogue_data

Scripts to prepare catalogue data
Apache License 2.0
8 stars 1 forks source link

remove whitespace, numbers and punctuation before hashing #33

Closed lvwerra closed 2 years ago

lvwerra commented 2 years ago

This makes document deduplication insensitive to whitespaces, numbers and punctuation.