How would you go about instruction finetuning?

jordancole21 commented 1 year ago

How would you finetune in this style with an instruction finetuning data set like Open-Orca?

syzymon commented 1 year ago

Hi, thanks for an excellent question and suggestion of the dataset! We are planning to provide an example of fine-tuning our models using the huggingface API, which would include instruction tuning.

soacker commented 1 year ago

How long will it take before we can see your fine-tune code？thanks

rivaldophilip commented 1 year ago

Following on this as well! Would love to try an Instruct-LongLlama

syzymon commented 1 year ago

We are planning to release instruction tuning code in pytorch & checkpoints & examples early next week. Stay tuned!

jordancole21 commented 1 year ago

We are planning to release instruction tuning code in pytorch & checkpoints & examples early next week. Stay tuned!

Thank you! Super excited now!

syzymon commented 1 year ago

In case you haven't seen, the instruction code is already there! see https://twitter.com/s_tworkowski/status/1687620785379360768 and READMEs in this repo for more details

jordancole21 commented 1 year ago

In case you haven't seen, the instruction code is already there! see https://twitter.com/s_tworkowski/status/1687620785379360768 and READMEs in this repo for more details

Hey thanks again for sharing this! Just curious, would this training also work for much smaller models like Pythia 160m or does this only work for the LongLLama model yall released at the moment?

syzymon commented 1 year ago

In terms of instruction finetuning, I personally don't think that it makes much sense to do SFT on models below 3B - I mean 3B models are not very capable and the thing you want to achieve with SFT is to actually have a useful model to interact with. Our code is technically just a bunch of optimizations to fit a 3B model into an A100, so it's not very useful for a 160m model I presume.

But for the FoT pretraining if you want to test hypotheses by measuring perplexity, you're encouraged to look at our paper https://arxiv.org/abs/2307.03170 in particular section 5 where all experiments are done on a ~150M parameter scale. We initially developed this method on such a scale and then scaled it to billion parameter models, so it definitely should work. There is currently no pretraining codebase available though, we are planning to open source one in JAX after LongLLaMA v2 release, but there are no plans to write a pytorch codebase on our side.

jordancole21 commented 1 year ago

In terms of instruction finetuning, I personally don't think that it makes much sense to do SFT on models below 3B - I mean 3B models are not very capable and the thing you want to achieve with SFT is to actually have a useful model to interact with. Our code is technically just a bunch of optimizations to fit a 3B model into an A100, so it's not very useful for a 160m model I presume.

But for the FoT pretraining if you want to test hypotheses by measuring perplexity, you're encouraged to look at our paper https://arxiv.org/abs/2307.03170 in particular section 5 where all experiments are done on a ~150M parameter scale. We initially developed this method on such a scale and then scaled it to billion parameter models, so it definitely should work. There is currently no pretraining codebase available though, we are planning to open source one in JAX after LongLLaMA v2 release, but there are no plans to write a pytorch codebase on our side.

Yeah right now I'm testing a hypothesis similar to the Tiny Stories and the Textbooks are All You Need paper. Where the goal is to see how far we can push small models by using more data and training then longer than you normally would. At the moment I have a dataset of 12 million reasoning style instructions and I'm planning on training a few models on that for up to 20 epochs.

But I'm curious is there a way to make an off the shelf model work with FOT, or does it have to be pretrained in the way y'all set it up, and then finetuned further to follow instructions?

Thanks again for the help!

syzymon commented 1 year ago

I see, I see, I think now I understand your question. There is no requirement for a model to be pre-trained with FoT. Both in the paper and LongLLaMA we take models trained in a vanilla way, and the fine-tune with FoT (note that we do not perform instruction tuning, just continued pretraining on large amount, generic non-instruction data like a base model is trained). My doubt is actually what do you mean by FoT here, it has 2 components: training on negative examples, and architecture. I don't think training on negative examples makes much sense in the context of instruction tuning, but i haven't tried that, just seems unreasonable. Architecturally, there is no problem in fine-tuning an existing model like pythia with FoT architecture (basically having a subset of layers responsible for handling long-range dependencies and disabling positionals)

jordancole21 commented 1 year ago

Oh I think I understand! Sorry for the confusion. So in order to get an off the shelf model to work like the LongLlama models y'all released I need to pretrain it for longer using y'all's pre training code and then if I wanted to follow instructions for long context, that's where y'all's latest finetuning code would come in?

If that's correct, do y'all have any plans to release either the pretraining code and data or some of the smaller models you worked with?

Thanks again!

jordancole21 commented 1 year ago

Also would y'all be willing to get on a zoom call or a Google meet to discuss it a little bit more? If not that's totally cool!

syzymon commented 1 year ago

So in order to get an off the shelf model to work like the LongLlama models y'all released I need to pretrain it for longer using y'all's pre training code and then if I wanted to follow instructions for long context, that's where y'all's latest finetuning code would come in?

That's it, basically. If you want to achieve long context capability just by instruction finetuning, honestly you'd better try sth like longchat. But their training data is not public as well, and you definitely need to have long-context instruction tuning data which is not there. That's where our bet on continued long-context pretraining with the FoT objective and then instruction fine-tuning on short context to follow instructions comes from.

By the way, it is an interesting hypothesis to test if, similarly to generalizing to multiple languages just after instruction tune in pure English, LLMs could generalize to longer instructions without seeing them at train time. I think it just requires some playing with inference with our current instruct model to actually test that hypothesis - not exactly sure how to formulate it rigorously.

For the release of the pretraining code, see this tt thread: https://twitter.com/4evaBehindSOTA/status/1687757769120862209?s=20 Note that our current code is in JAX and it would require substantial engineering work on your side to understand it, and rewrite into pytorch, if you wanted. If you are willing to undertake this work (which would probably be quite valuable for the community if you then publish), please let me know and we could set up a call or sth.

CStanKonrad / long_llama

How would you go about instruction finetuning? #2