The current code results in pretty slow of grouping and saving of text if the dataset is large.

I have run the test against wiki dataset to confirm the change works.

Test

python3.8 get_dataset.py Running tokenizer on dataset: 0%| | 0/1359146 Running tokenizer on dataset: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1359146/1359146 [07:39<00:00, 2959.40 examples/s] block_size > tokenizer.model_max_length Grouping texts in chunks of 512 (num_proc=8): 100%|████████████████████████████████████████████████████████████████████████████████████████████| 1641875 Saving the dataset (22/22 shards): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████|

aws-neuron / aws-neuron-samples

use multi processing for large datasets #51

Test