feature to get the compliment of a hash sample

We need to be able to get both the held out data and the not held out data when making splits with scripts/hash_sample.py.

I added a --complement flag to this script that just writes out the hashes that dont match the calculate_md5_suffix suffixes. I also updated the logging statement to reflect that it is doing this.

I did a very rudimentary test by splitting a RedPajama file (pretraining-data/sources/redpajama/v1/documents/split=train/dataset=c4/c4-train.00000-of-01024_00000.jsonl.gz) this way at 5% and making sure that its hash sample and its compliment added up to the full number of documents (17932 + 338023 = 355955).

allenai / dolma

feature to get the compliment of a hash sample #72