Open sankexin opened 3 hours ago
Yes, that's correct. There's also an alternative supported dataset format where, instead of a numpy array, you use a tensor. In this case, you concatenate all tensors for the documents just as you would with numpy, but in the end, you save the resulting tensor with torch.save(final_tensor, "train.pt")
.
https://github.com/karpathy/nanoGPT/blob/master/data/shakespeare/prepare.py
../data/allamo_1B_dataset/
The idea for this project is great, thank you. Is it correct to use the data in this way? I wish to further improve the entire project by providing pretrained(not finetune) open source Llama3. I am considering contributing the Llama3 open source code trained from scratch to the open source community too.