FranxYao / Long-Context-Data-Engineering

Implementation of paper Data Engineering for Scaling Language Models to 128K Context
441 stars 29 forks source link

Small correction for YaRN-Mistral model #2

Open bloc97 opened 9 months ago

bloc97 commented 9 months ago

Hello! Author of YaRN here. First of all thank you for this very comprehensive paper on data engineering challenges for long context LLMs. It will certainly be very useful for the research community in the quest of training better and more robust long context models!

However, there's been a small confusion on how the YaRN Mistral 7B 128K model was trained (Fig. 1 of the paper), this model was trained on a 16k context length dataset without length upsampling (the dataset used is a derivative of what TogetherAI used to train their 32k model, but chunked to 16k instead). The Llama 2 7B 128K model is the one that was trained on PG19, chunked in a context of 64k (not 128k), which I think would be a more appropriate comparison, there's simply too many confounding variables with our Mistral YaRN models.

Also, the reason that we were able to get away with training with such a small context (16k) is because YaRN exhibits the behaviour necessary for context length extrapolation even without finetuning (albeit not very good and only for small extension scale ratios).

Unfortunately, the passkey evaluation that we used was much more easy compared to the Needle-in-a-Haystack test (didn't exist back then), we originally did not notice any degradation of long context capabilities by shortening the dataset from 128k to 64k then to 16k (cheaper to train), but with the newer Needle-in-a-Haystack tests, the degradation is apparent. We will certainly be trying out the new methods outlined in this paper for future finetunes!

FranxYao commented 9 months ago

Thank you for the detailed explanation! Definitely YaRN is a great work and I draw a lot of inspirations! Also I totally understand the timeline and Needle-in-a-Haystack is quite recent (and passkey retrieval may not as informative but that was the best available eval). Also I believe when given the length-upsampled data, YaRN-Mistral will perform just the same or better. I'll update the paper to incorporate the information. Also if you could mention this on your paper/ github so I can refer to?

Additionally, I wonder what is the differences between 64K / 128K YaRN Mistral? Like are they both finetuned on 16K but one extrapolate to 64K and another extrapolate to 128K? And what about the YaRN-LLaMA 64K/ 128K?

Thanks!

bloc97 commented 9 months ago

Additionally, I wonder what is the differences between 64K / 128K YaRN Mistral? Like are they both finetuned on 16K but one extrapolate to 64K and another extrapolate to 128K? And what about the YaRN-LLaMA 64K/ 128K?

The Llama YaRN models were trained with 64k data, but with a higher YaRN scaling factor (similar to higher base in ABF) such that the final 128k model is able to extrapolate from 64k data to 128k context size. Then we noticed that we could have gotten away with only training with 16k data (which we now know its not optimal), and that's what we did for the Mistral 64k and 128k models. The same phenomenon is observed in your paper, where you train on 80k data and it extrapolates to 128k.

Non-linearity in RoPE interpolation is definitively the key in unlocking extrapolation capabilities (train short, test long) for RoPE models, we just gotta find the best one.