GradientAI Auto ROPE Base calculation

askmyteapot commented 3 weeks ago

https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models has a formula that better fits the ideal rope scaling.

Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.

askmyteapot commented 3 weeks ago

Have done some benchmarks using Perplexity

Results are as follows:

Model	Base Rope	Tested Rope	Kobo or Gradient	CTX_train	CTX_test	PPL Result
L3-15B-q8	500000	1638400.0	Kobo	8192	16384	6.3957 +/- 0.04083
L3-15B-q8	500000	1776948.1	Gradient	8192	16384	6.0832 +/- 0.03830
L3-15B-q8	500000	1843000.0	Manual	8192	16384	6.1221 +/- 0.03865
L2-13B-q4	10000	65536	Kobo	4096	16384	6.8271 +/- 0.04360
L2-13B-q4	10000	71738	Gradient	4096	16384	6.9586 +/- 0.04421
L2-13B-q4	10000	49152	Kobo	4096	12288	6.0357 +/- 0.03804
L2-13B-q4	10000	47661	Gradient	4096	12288	6.0041 +/- 0.03785
L2-13B-q4	10000	32768	Kobo	4096	8192	6.0434 +/- 0.03913
L2-13B-q4	10000	26784	Gradient	4096	8192	5.9039 +/- 0.03831

For Llama3, definitely a better fit. For Llama2, a better fit for doubling and tripling context, but worse for quadrupling. (however 4x context on Llama2 is 12GB just for KV, so i doubt most will use it)

Hope that helps.

EDIT: I'm an idiot and used my manual tuning result on llama3 and didnt actually include the formula result. Table has been updated now.

askmyteapot commented 2 weeks ago

Changing to draft. Discovered some scaling issues with Solar models (mistral 7b 0.1 with scaling window)

Discovered SWA models require the CTX figures to be 8x to get close to a suitable rope base

askmyteapot commented 2 weeks ago

Ok. i managed to get a workable solution for Solar based models (like fimb). Had to use the total tensor count of 435 with the base of 10000 to identify them.

i havent figured out a decent way to identify original Mistral 7B v0.1 models, but in theory, it should use the previous logic, as it 'thinks' it has a context of 32k. If GGUF had metadata for "sliding window" then it would be easy.

LostRuins commented 2 weeks ago

Is this PR ready for review yet, or still in development?

LostRuins / koboldcpp

GradientAI Auto ROPE Base calculation #910