Closed WilliamsToTo closed 1 month ago
@WilliamsToTo generally there are a lot of small differences, data, batch size, addition of chat templates for SFT, learning rate schedulers, but you can do both in most repo's. For example, we've reproduced some open-instruct
results with the olmo repository.
I'm going to close the issue, as it's not really related to the codebase, but feel free to reopen if you had an issue with the code.
As I understand it, the primary difference between pre-training and supervised fine-tuning (SFT) lies in the dataset used. Pre-training is conducted on a plain text corpus, whereas SFT utilizes a dataset in a specific format, such as TULU. Are there any differences in the training scripts, loss functions, or hyperparameters between pre-training and SFT?