Closed dinosaurtirex closed 2 years ago
Hey @sneakybeaky18,
the list file is basically a text file with the file paths to your data. The data should be split into equal shards in order to run distributed learning. For example:
data = open('/home/jovyan/data/all_data.txt', 'r').read().split('\n')
batch_size = len(data) // num_gpus
with open("/home/jovyan/data/quests/final/train.list", "w") as file:
idx = 0
while data:
with open(f"/home/jovyan/data/train{idx}.txt", "w") as file_t:
for line in data[:batch_size]:
file_t.write(f"{line}\n")
file.write(f"/home/jovyan/data/train{idx}.txt\n")
idx += 1
data = data[batch_size:]```
After you can use that as shown in the example [Colab](https://colab.research.google.com/github/ai-forever/ru-gpts/blob/master/examples/ruGPT3XL_finetune_example.ipynb)
Thanks to your team for great open source project, and thanks for the answer
Hello! Can you please explain how i should create .list file? Don't understand that moment