EleutherAI / pythia

The hub for EleutherAI's work on interpretability and learning dynamics
Apache License 2.0
2.16k stars 156 forks source link

Reshape error in batch viewer #158

Closed activatedgeek closed 2 months ago

activatedgeek commented 2 months ago

Thank you for the great project!

I have successfully been able to merge all the shards from EleutherAI/pythia_deduped_pile_idxmaps.

However, while trying to get batches out of the utils/batch_viewer.py, I get the following error:

    reading sizes...
    reading pointers...
    reading document index...
    creating numpy buffer of mmap...
    creating memory view of numpy buffer...
/datasets/mmap_dataset.py:226: RuntimeWarning: overflow encountered in scalar add
  offsets = list(accumulate(sizes))
Traceback (most recent call last):
  File "/datasets/batch_viewer.py", line 42, in <module>
    indicies = dataset[args.start_iteration*1024: args.end_iteration*1024 + 1]
               ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/datasets/mmap_dataset.py", line 231, in __getitem__
    return np_array.reshape(-1, 2049)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: cannot reshape array of size 207170414058 into shape (2049)

Each sample here seems to be of uneven length, and makes sense why this code would fail.

Would you be able to help me (or just point me to a code reference) so that I can chunk the document into the 2049-sized chunks? For context, I only want to do evaluations on top of a subset of training data. I want the chunks to be constructed precisely the same way as during training so that I put them in a dataloader and simply subsample on top (perhaps something like a torch.utils.data.Subset).

activatedgeek commented 2 months ago

It looks like the right dataset to use there is EleutherAI/pile-deduped-pythia-preshuffled, which gives even 2049-sized across all samples.