LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.34k stars 310 forks source link

GradientAI Auto ROPE Base calculation #910

Closed askmyteapot closed 2 weeks ago

askmyteapot commented 3 weeks ago

https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models has a formula that better fits the ideal rope scaling.

Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.

askmyteapot commented 3 weeks ago

Have done some benchmarks using Perplexity

Results are as follows:

Model Base Rope Tested Rope Kobo or Gradient CTX_train CTX_test PPL Result
L3-15B-q8 500000 1638400.0 Kobo 8192 16384 6.3957 +/- 0.04083
L3-15B-q8 500000 1776948.1 Gradient 8192 16384 6.0832 +/- 0.03830
L3-15B-q8 500000 1843000.0 Manual 8192 16384 6.1221 +/- 0.03865
L2-13B-q4 10000 65536 Kobo 4096 16384 6.8271 +/- 0.04360
L2-13B-q4 10000 71738 Gradient 4096 16384 6.9586 +/- 0.04421
L2-13B-q4 10000 49152 Kobo 4096 12288 6.0357 +/- 0.03804
L2-13B-q4 10000 47661 Gradient 4096 12288 6.0041 +/- 0.03785
L2-13B-q4 10000 32768 Kobo 4096 8192 6.0434 +/- 0.03913
L2-13B-q4 10000 26784 Gradient 4096 8192 5.9039 +/- 0.03831

For Llama3, definitely a better fit. For Llama2, a better fit for doubling and tripling context, but worse for quadrupling. (however 4x context on Llama2 is 12GB just for KV, so i doubt most will use it)

Hope that helps.

EDIT: I'm an idiot and used my manual tuning result on llama3 and didnt actually include the formula result. Table has been updated now.

askmyteapot commented 2 weeks ago

Changing to draft. Discovered some scaling issues with Solar models (mistral 7b 0.1 with scaling window)

Discovered SWA models require the CTX figures to be 8x to get close to a suitable rope base

askmyteapot commented 2 weeks ago

Ok. i managed to get a workable solution for Solar based models (like fimb). Had to use the total tensor count of 435 with the base of 10000 to identify them.

i havent figured out a decent way to identify original Mistral 7B v0.1 models, but in theory, it should use the previous logic, as it 'thinks' it has a context of 32k. If GGUF had metadata for "sliding window" then it would be easy.

LostRuins commented 2 weeks ago

Is this PR ready for review yet, or still in development?