bigscience-workshop / catalogue_data

Scripts to prepare catalogue data
Apache License 2.0
8 stars 1 forks source link

Code doesn't need to run deduplication script #46

Closed thomasw21 closed 2 years ago

thomasw21 commented 2 years ago

Code doesn't need to run deduplication script as document level was already done, and line deduplication is undesired. Can you confirm @lvwerra @TevenLeScao ? We could also run deduplication on document just in case. LMK.

TevenLeScao commented 2 years ago

Confirmed!

lvwerra commented 2 years ago

Unconfirmed: document level deduplication was run with hash which we discovered does not work well with mutliprocessing. So i would run it again.

thomasw21 commented 2 years ago

Ah yes @lvwerra nice catch! FYI, this was the fix https://github.com/bigscience-workshop/catalogue_data/pull/25