Open cmp-nct opened 1 year ago
Update; Will be delayed for another day
Results with 40B are quite good, though calculating something like perplexity appears a difficult task due to the performance loss at high context. It's 6+ times faster than before the recent KV cache PR but that's still too slow to use comfortably at 8k+ context. 7B on the other hand is fast, it's still good at 8k context but I've not had the same success with it. It loses attention at less than 3k context while 40B manages to stay focused beyond 8k.
Well in the end I found that my elaborate new method was more than beaten by findings of two reddit users called bloc97 and emozilla. So after probably 12 hours continued debugging into optimizing ROPE dynamically by compressing the rotation space in various ways and struggling with Falcon 7B just not coping well with any change I stumbled on their findings. It's called "NTK aware scaling". I've been living behind a rock in Falconverse, that is known for almost 2 weeks..
After implementing that quite closely Falcon 7B gave me a brilliant response at 4k context (still > 30 tokens/sec generation and a good 8k response as well (see below) Falcon 7B at > 8k context
- The code defines a function called
compute_embeddings
that takes in the input tensor,embedding
, and returns an output tensor with the language modeling logits.- It uses the
ggml_cuda_assign_buffers
function to set up the GPU buffer for the input tensor,embedding
, and the output tensor,logits
.- It then uses the
ggml_norm
function to normalize the input tensor and the logits tensor.- It computes the language modeling head using the
lm_head
tensor.- It calculates the attention scores for each token in the input tensor using the
ggml_mul
function and theggml_norm
function.- It uses the
ggml_mul
function to multiply the attention scores with the embedding tensor, and then applies a softmax activation function to produce the language modeling logits.- It computes the attention scores for the last token using the
ggml_mul
function and theggml_norm
function.- It calculates the language modeling logits for the last token using the
ggml_softmax
function.- It uses the
ggml_mul
function to multiply the attention scores with the language modeling logits, and applies a softmax activation function to produce the language modeling logits.- It computes the language modeling logits for each token using the
ggml_softmax
function.- It calculates the language modeling logits for the last token using the
ggml_mul
function and applies a softmax activation function to produce the language modeling logits.- It uses the
ggml_add
function to add the language modeling logits for each token, and the attention scores for the last token.- It computes the language modeling logits for the last token using the
ggml_add
function and applies a softmax activation function to produce the language modeling logits.- It updates the kv token count with the language modeling logits for each token.
- It extracts the embeddings from the output tensor,
logits
, and returns the tensor.- It frees the GPU buffer using the
ggml_free
function.- It calculates the performance metrics for the single-token evaluations.
- It returns true if the code executed successfully.
- If
debug_timings
is greater than 0, it prints timing information for each ggml operation.- The function returns true at the end of execution.
Falcon 40B results are now also in: The quality is much better than before, hallucinations are minimal now.
llama::falcon_compute()
is a function that performs forward propagation in a transformer model, specifically the Falcon language model. It takes as input a context objectctx0
, which contains information about the current layer being processed and the tensor buffers used for storing intermediate results. The function starts by initializing some variables, such asi_gpu_start
,i_gpu_last
,n_layer
, andn_past
. It then loops over each transformer layer in the model, starting from the current layer (il
) to the last layer (model.layers.size()-1
).For each layer, it performs the following steps:
- Computes the attention weights between the input token embeddings and the previous tokens' embeddings using a rope trick.
- Applies a QKV rotation to the input token embeddings and the attention weights.
- Performs element-wise multiplication between the rotated input token embeddings and the attention weights, followed by softmax normalization to obtain the attention scores.
- Computes the key-value product using matrix multiplication and applies a scale factor to the result.
- Applies a linear transformation to the attention scores and the output of the previous layer (
KV_prev
orVmem
).- Computes the dot product between the attention scores and the output of the previous layer, followed by softmax normalization to obtain the final output probabilities.
- If necessary, applies a mask to the output probabilities to hide some tokens from the decoder.
- Exports the output probabilities to disk for evaluation purposes (if
lctx.export_perf
is true).- Updates the token count used by the language model's attention mechanism.
- Measures the time taken by this function and stores it in a variable
t_p_eval_us
.- If necessary, copies the embeddings tensor to an output buffer for evaluation purposes.
- Returns true if all layers have been processed successfully, or false otherwise.
Note that this is just a summary of the key steps performed by the function, and there are additional details in the code comments that provide more information about each step. Additionally, the function uses some helper functions such as
ggml_repeat()
,ggml_permute()
,ggml_diag_mask_inplace()
,ggml_scale_inplace()
, andggml_copy()
.
https://github.com/cmp-nct/ggllm.cpp/pull/65 Merge tomorrow if no issues come up
How is the prompt processing performance like? I found that even at 3k context, the model was unfortunately not practical to use due to the tokenization speed on an RTX 4080.
How is the prompt processing performance like? I found that even at 3k context, the model was unfortunately not practical to use due to the tokenization speed on an RTX 4080.
I'm working on that part, for a large prompt you need to use "-b" to process the prompt in batches. This is quite a bit flawed currently, I'm working on a overhaul already. -b 64 to 512 work best (memory increases non linearly with increasing context when using -b)
When using 7B 3k context is quite useable currently. Sadly the current release has prompt-cache broken, also fixing that ;) Next release will take a week or two.
How is the prompt processing performance like? I found that even at 3k context, the model was unfortunately not practical to use due to the tokenization speed on an RTX 4080.
the current branch ggfalcon_dev did make some progress in terms of processing speed though kv cache is not done on GPU yet, that's the main limitation which gets larger with context
I plan to PR today, though it depends on final progress. The computation speed is slow because we currently have no mulmat kernel with interleaving broadcast support yet, so tests are time consuming. Falcon has twice the vocabulary than llama, in practice that means that Falcon naturally has a performance benefit of 30-40% on english text and about 20-25% on code and foreign languages. This also means that 50 tokens/sec flacon speed is about as fast as 70 tokens/sec on llama in terms of language throughput. So a 8k context window on Falcon is equivalent to ~12k context on llama.
The task: Pre-processing a large input such as a book chapter, complex code, a tutorial or a transcription of a meeting Now I want to be able to interview Falcon about this huge text to work with it, extend it or transform it
For the current work I copied the entire falcon_eval_internal() function from current libfalcon.cpp, that's 20kb of source code and quite exactly 7k falcon tokens and the question asked is
"<|prompter|>Write a summary of 10 sentences covering everything this function does<|endoftext|><|assistant|>"
I'm processing this on a high quality quantization: The 40B Q5_K (OpenAssistant).
Default Normal Falcon result on the above question and libfalcon.cpp input:
What is going on ? If we look below the surface of how the model understands text, the most essential part for the relationship between tokens is the positional encoding done through "ROPE". Sounds super compilcated but actually all it is is a 2d rotation of each token based on it's position in the total context. Visualized this rotation of one embedding: This is how the model was trained to understand relationships between tokens and sequences within a 2048 token context. I am not entirely sure why this quite tight rotation is being used, I assume (hope) someone mathed those parameters out.
Beyond that 2048 context it happens quite fast that the model does not calculate proper attention anymore, at 7k context it's completely braindead.
But by adapting the angle of rotation we can push it back into reality. For example 8k context with a fixed scaled rotation angle:
The model output now:
"<|prompter|>Write a summary of 10 sentences covering everything this function does<|endoftext|><|assistant|>"
Here is another variant:
This is WIP. I currently have a bunch of different variants running that all perform a bit different. The amount of hallucination is striking.
The benchmark is the best OpenAI currently has to offer, of course they not only have good parameters but also were fine tuned for that purpose. Fine tuning is something we can do once the Falcon large context parameters are chosen.
Turbo-16k
GPT4 at 8k:
Overall Turbo as well as GPT4 provide a definitely better roundup, especially regarding hallucinations, not super convincing in all cases which is also caused by the code being above the understanding of any llm today.