cmp-nct / ggllm.cpp

Falcon LLM ggml framework with CPU and GPU support
Other
244 stars 21 forks source link

16k+ context upgrade - Long-range Falcon #65

Closed cmp-nct closed 1 year ago

cmp-nct commented 1 year ago

Default context is now 2048 The embedding rotation has been adapted to react to context and expected generation Uses "NTK" fourier aware scaling of the rotation space.

7B and 40B have been tested to work well up to a context of 8k Tests at > 8k are incoming once performance at these sizes works better

RAM requirements for K/V caches: Falcon 7B at 8k context : ~2 GB RAM Falcon 40B at 8k context : ~5.5 GB RAM

In addition falcon_eval() now uses a configuration struct instead of passing many parameters through multiple abstraction layers. This makes it much easier to pass new features from main into libfalcon

maddes8cht commented 1 year ago

with the Llama models we have these superhot models being trained on larger context sizes. While the RoPE build works with the "normal" Llama-Models, these special models finetunings still bring some extra quality into the long context. Would this be the same for Falcon? Will someone build such finetuned falcons for long context?

cmp-nct commented 1 year ago

with the Llama models we have these superhot models being trained on larger context sizes. While the RoPE build works with the "normal" Llama-Models, these special models finetunings still bring some extra quality into the long context. Would this be the same for Falcon? Will someone build such finetuned falcons for long context?

In my tests Falcon appeared sane and stable with the latest version, even 7B had no issues understanding 20-30kb of source code. So right now, it just works out of the box at 8k context. Which is similar to a 12k context on llama in terms of words.

Fine tuning likely is helpful but does not appear to be necessary. I plan to get into that but we have more pressing issues for now. Priority for now is performance, especially large context performance needs improvements. Also a good web-API is needed. Without those performance improvements it's really hard to test the quality differences.

maddes8cht commented 1 year ago

I've been playing around with the context sizes and I'm really impressed. Above 8k it gets really slow, but it actually works and for some purposes I won't even care about the speed to some extent One question: we initially had 512 as default, then 2048, now have the option with 8192 or even 16384 token. Of course you can also set completely different ctx values, but these are the values that are mostly used. Is there a reason for any kind of performance advantage with real powers of 2? Or are these just the numbers that are usually used by coders? ;)

cmp-nct commented 1 year ago

I've been playing around with the context sizes and I'm really impressed. Above 8k it gets really slow, but it actually works and for some purposes I won't even care about the speed to some extent One question: we initially had 512 as default, then 2048, now have the option with 8192 or even 16384 token. Of course you can also set completely different ctx values, but these are the values that are mostly used. Is there a reason for any kind of performance advantage with real powers of 2? Or are these just the numbers that are usually used by coders? ;)

Great to hear, you probably have more experience with the long context Falcon than me. I'm digged deep in code :) Just working on a complete overhaul with a significant improvement in speed if it works out the way I want.

Regarding context sizes being power of 2 (binary numeric base instead of decimal): Often those numbers align well. For example to operate in memory efficiently you need to be aligned in a specific power of 2. Sometimes the memory buffers must be aligned to X^2 bytes work at all (for a multiplication where input variables span multiple bytes for example). When multiplying tensors some methods might not work if the shape (number of elements) is odd in a dimension, cuBLAS had that problem with a tensor.

In terms of context: Falcon was trained using 2048 context. That should be the context where it works best, it's how it learned to process information. llama was also trained on 2048 but the ggml 512 ctx most likely was chosen because it's so memory heavy. Falcon is so much lighter on memory, so I decided to lift the default to where it should have been from begin on.

You can use it at even decimal numbers instead of binary numbers, it should not make a difference. So 2000 or 4000 will work fine. So in that 'context' I believe it's mostly chosen by coders out of pragmatism (a clean fourth of 2048). could have been 500 too.

EliEron commented 1 year ago

Just curious is the NTK scaling code in this PR based on the original idea posted by bloc97 on Reddit, or the newer NTK-By-Parts method that bloc97 released around a week ago?

cmp-nct commented 1 year ago

Just curious is the NTK scaling code in this PR based on the original idea posted by bloc97 on Reddit, or the newer NTK-By-Parts method that bloc97 released around a week ago?

It's the adaptive NTK scaling but not the interpolated one from that example. Not a big thing adding that one if it's better.

EliEron commented 1 year ago

Just curious is the NTK scaling code in this PR based on the original idea posted by bloc97 on Reddit, or the newer NTK-By-Parts method that bloc97 released around a week ago?

It's the adaptive NTK scaling but not the interpolated one from that example. Not a big thing adding that one if it's better.

Okay that's good to know. I haven't played around with the newer method myself. But according to this comment from jquesnelle (who works with bloc97) it is seemingly an across the board improvement over the original method.

The version I linked to is the non-adaptive version, which is better for finetuning from my understanding but not inference with original models. There is an improved adaptive version as well in the repo. I'm not entirely clear on whether that is the version you actually implemented here or not, but if it isn't then you might want to look at the Falcon version they have available.

I hope I'm not pointing out anything obvious, I just wanted to put things on your radar in case you hadn't come across them yet. It's hard to keep track of everything going on in the LLM space these days 😄.