Closed askmyteapot closed 2 weeks ago
Have done some benchmarks using Perplexity
Results are as follows:
Model | Base Rope | Tested Rope | Kobo or Gradient | CTX_train | CTX_test | PPL Result |
---|---|---|---|---|---|---|
L3-15B-q8 | 500000 | 1638400.0 | Kobo | 8192 | 16384 | 6.3957 +/- 0.04083 |
L3-15B-q8 | 500000 | 1776948.1 | Gradient | 8192 | 16384 | 6.0832 +/- 0.03830 |
L3-15B-q8 | 500000 | 1843000.0 | Manual | 8192 | 16384 | 6.1221 +/- 0.03865 |
L2-13B-q4 | 10000 | 65536 | Kobo | 4096 | 16384 | 6.8271 +/- 0.04360 |
L2-13B-q4 | 10000 | 71738 | Gradient | 4096 | 16384 | 6.9586 +/- 0.04421 |
L2-13B-q4 | 10000 | 49152 | Kobo | 4096 | 12288 | 6.0357 +/- 0.03804 |
L2-13B-q4 | 10000 | 47661 | Gradient | 4096 | 12288 | 6.0041 +/- 0.03785 |
L2-13B-q4 | 10000 | 32768 | Kobo | 4096 | 8192 | 6.0434 +/- 0.03913 |
L2-13B-q4 | 10000 | 26784 | Gradient | 4096 | 8192 | 5.9039 +/- 0.03831 |
For Llama3, definitely a better fit. For Llama2, a better fit for doubling and tripling context, but worse for quadrupling. (however 4x context on Llama2 is 12GB just for KV, so i doubt most will use it)
Hope that helps.
EDIT: I'm an idiot and used my manual tuning result on llama3 and didnt actually include the formula result. Table has been updated now.
Changing to draft. Discovered some scaling issues with Solar models (mistral 7b 0.1 with scaling window)
Discovered SWA models require the CTX figures to be 8x to get close to a suitable rope base
Ok. i managed to get a workable solution for Solar based models (like fimb). Had to use the total tensor count of 435 with the base of 10000 to identify them.
i havent figured out a decent way to identify original Mistral 7B v0.1 models, but in theory, it should use the previous logic, as it 'thinks' it has a context of 32k. If GGUF had metadata for "sliding window" then it would be easy.
Is this PR ready for review yet, or still in development?
https://gradient.ai/blog/scaling-rotational-embeddings-for-long-context-language-models has a formula that better fits the ideal rope scaling.
Tested with Lllama3, checked calculation is correct for llama2. Retains logic for not scaling rope if under trained CTX.