I was wondering can you provide the index_mapping files that is generated by the GPT2Dataset? From the construction of gpt2dataset at here, I can see there are three npy index files
I was wondering can you provide a copy of these files so that I don't need to regenerate them?
I ask this request because I want to study the influence of the original training data by chunk. I have prepared the pythia-dedup dataset, but I failed to build the environments. After reading the code of GPT2Dataset, I found that with these index files, I can reproduce the original training data of pythia.
I noticed that you provide the batch_viewer.py to check the unshuffled data, but it seems that these data is still different from the actually training data that is fed into the model during the training process.
Hi,
I was wondering can you provide the index_mapping files that is generated by the GPT2Dataset? From the construction of gpt2dataset at here, I can see there are three
npy
index filesI was wondering can you provide a copy of these files so that I don't need to regenerate them?
I ask this request because I want to study the influence of the original training data by chunk. I have prepared the pythia-dedup dataset, but I failed to build the environments. After reading the code of GPT2Dataset, I found that with these index files, I can reproduce the original training data of pythia.
I noticed that you provide the
batch_viewer.py
to check the unshuffled data, but it seems that these data is still different from the actually training data that is fed into the model during the training process.Thanks