Reshape error in batch viewer

Thank you for the great project!

I have successfully been able to merge all the shards from EleutherAI/pythia_deduped_pile_idxmaps.

However, while trying to get batches out of the utils/batch_viewer.py, I get the following error:

    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
/datasets/mmap_dataset.py:226: RuntimeWarning: overflow encountered in scalar add
  offsets = list(accumulate(sizes))
Traceback (most recent call last):
  File "/datasets/batch_viewer.py", line 42, in <module>
    indicies = dataset[args.start_iteration*1024: args.end_iteration*1024 + 1]
               ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/datasets/mmap_dataset.py", line 231, in __getitem__
    return np_array.reshape(-1, 2049)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot reshape array of size 207170414058 into shape (2049)

Each sample here seems to be of uneven length, and makes sense why this code would fail.

Would you be able to help me (or just point me to a code reference) so that I can chunk the document into the 2049-sized chunks? For context, I only want to do evaluations on top of a subset of training data. I want the chunks to be constructed precisely the same way as during training so that I put them in a dataloader and simply subsample on top (perhaps something like a torch.utils.data.Subset).

EleutherAI / pythia

Reshape error in batch viewer #158