Closed thomasw21 closed 2 years ago
Confirmed!
Unconfirmed: document level deduplication was run with hash
which we discovered does not work well with mutliprocessing
. So i would run it again.
Ah yes @lvwerra nice catch! FYI, this was the fix https://github.com/bigscience-workshop/catalogue_data/pull/25
Code doesn't need to run deduplication script as document level was already done, and line deduplication is undesired. Can you confirm @lvwerra @TevenLeScao ? We could also run deduplication on document just in case. LMK.