chrisociepa / allamo

Simple, hackable and fast implementation for training/finetuning medium-sized LLaMA-based models
MIT License
152 stars 15 forks source link

Exception: Training dataset files not found! #17

Open sankexin opened 3 hours ago

sankexin commented 3 hours ago

https://github.com/karpathy/nanoGPT/blob/master/data/shakespeare/prepare.py

../data/allamo_1B_dataset/

input.txt  
train.bin
val.bin

The idea for this project is great, thank you. Is it correct to use the data in this way? I wish to further improve the entire project by providing pretrained(not finetune) open source Llama3. I am considering contributing the Llama3 open source code trained from scratch to the open source community too.

chrisociepa commented 3 hours ago

Yes, that's correct. There's also an alternative supported dataset format where, instead of a numpy array, you use a tensor. In this case, you concatenate all tensors for the documents just as you would with numpy, but in the end, you save the resulting tensor with torch.save(final_tensor, "train.pt").