We need to be able to get both the held out data and the not held out data when making splits with scripts/hash_sample.py.
I added a --complement flag to this script that just writes out the hashes that dont match the calculate_md5_suffix suffixes. I also updated the logging statement to reflect that it is doing this.
I did a very rudimentary test by splitting a RedPajama file (pretraining-data/sources/redpajama/v1/documents/split=train/dataset=c4/c4-train.00000-of-01024_00000.jsonl.gz) this way at 5% and making sure that its hash sample and its compliment added up to the full number of documents (17932 + 338023 = 355955).
We need to be able to get both the held out data and the not held out data when making splits with
scripts/hash_sample.py
.I added a
--complement
flag to this script that just writes out the hashes that dont match thecalculate_md5_suffix
suffixes. I also updated the logging statement to reflect that it is doing this.I did a very rudimentary test by splitting a RedPajama file (
pretraining-data/sources/redpajama/v1/documents/split=train/dataset=c4/c4-train.00000-of-01024_00000.jsonl.gz
) this way at 5% and making sure that its hash sample and its compliment added up to the full number of documents (17932 + 338023 = 355955
).