Upcoming PR - Pushing the Context limit to 8k+ for all existing Falcon models - Longrange Falcon flights

cmp-nct commented 1 year ago

I plan to PR today, though it depends on final progress. The computation speed is slow because we currently have no mulmat kernel with interleaving broadcast support yet, so tests are time consuming. Falcon has twice the vocabulary than llama, in practice that means that Falcon naturally has a performance benefit of 30-40% on english text and about 20-25% on code and foreign languages. This also means that 50 tokens/sec flacon speed is about as fast as 70 tokens/sec on llama in terms of language throughput. So a 8k context window on Falcon is equivalent to ~12k context on llama.

The task: Pre-processing a large input such as a book chapter, complex code, a tutorial or a transcription of a meeting Now I want to be able to interview Falcon about this huge text to work with it, extend it or transform it

For the current work I copied the entire falcon_eval_internal() function from current libfalcon.cpp, that's 20kb of source code and quite exactly 7k falcon tokens and the question asked is "<|prompter|>Write a summary of 10 sentences covering everything this function does<|endoftext|><|assistant|>"

I'm processing this on a high quality quantization: The 40B Q5_K (OpenAssistant).

Default Normal Falcon result on the above question and libfalcon.cpp input:

" when, as. and for are a to:, the by , use for a and on that: a,. for it and in, this you from is. for ,,. .' of.рен if( you they with,"

What is going on ? If we look below the surface of how the model understands text, the most essential part for the relationship between tokens is the positional encoding done through "ROPE". Sounds super compilcated but actually all it is is a 2d rotation of each token based on it's position in the total context. Visualized this rotation of one embedding: This is how the model was trained to understand relationships between tokens and sequences within a 2048 token context. I am not entirely sure why this quite tight rotation is being used, I assume (hope) someone mathed those parameters out.

Beyond that 2048 context it happens quite fast that the model does not calculate proper attention anymore, at 7k context it's completely braindead.

But by adapting the angle of rotation we can push it back into reality. For example 8k context with a fixed scaled rotation angle:

Sure, here's a summary of the function:

Initialize the context struct falcon with various parameters, such as the model, input tensor, output tensor, etc.

Check if broadcasting is enabled and whether to use it. If so, set up g_repeat accordingly for the first 3 dimensions.

Load input tensor from disk into a tensor using gml_tensor_from_file() or create an empty tensor with gml_zeros().

Create output tensor for embedding using gml_tensor_new_d() and initialize it with zeros if necessary.

Initialize the current layer to zero and set its input to the previous one, i.e., self.layer = -1.

Loop over each attention head in a sequence of length n_head:

Load token tensor from disk into a tensor using gml_tensor_from_file() or create an empty tensor with gml_zeros().

Normalize the token tensor using gnorm() to get the embeddings, and store it as self.data.

Compute scores for every token in sequence using llama_forward() and store them in self.scores.

Repeat the above steps for each token until end of sequence is reached.

Store the scores for all tokens in a matrix with gml_tensor_mul().

Normalize the matrix by dividing it by the sum of squares of squares, add one to avoid division by zero.

Softmax the matrix and store result as self.data. This is the token representation.

If embedding is enabled, load embeddings from the model.<|endoftext|>

Here is another variant:

Sure, here's a summary of what this function does:

The function performs the self-attention operation for one step in the transformer model. It takes in the input embeddings from the previous layer (inpL), the current token mask, and the query vectors for each head (Q) and computes the attention weight matrix (K). The attention weight matrix is used to compute the weighted sum of the embeddings from the previous step, scaled by sqrt(n_embd/head), and then softmaxed. This result is then multiplied with the value vector V to produce the updated embeddings for the current token, which are stored in KV. The attention is applied over the entire input sequence (N) and the final output is added to the value for each head. If broad casting is enabled, the attention weight matrix is broadcasted across all heads. The function then takes in the memory from the previous layer (LL), rotates it by shifting the rows by (n_past-1)head_dims and adds it to the current token embeddings to create the final output LLK. This is stored in cur and returned as the result of the function. If LLK is empty, it returns a single value from K for the first token. The function then takes in the past tokens (pL), the attention mask, and the query vectors for each head (Q) and computes the attention weight matrix (KV). The attention weight matrix is used to compute the weighted sum of embeddings from the previous step, scaled by sqrt(n_embd/head), and then softmaxed. This result is added to the value for each token, which are stored in Q. The past tokens are also rotated by (n_past-1)head_dims and added to K for the first token. If broad casting is enabled, the weight matrix is broadcast across all heads. The function then takes in the attention output LLK from the previous layer, adds it to the result V, and adds the memory from LLK-1 to produce the final output LLK_merged. This is stored in cur and returned as the result of the function.

This is WIP. I currently have a bunch of different variants running that all perform a bit different. The amount of hallucination is striking.

The benchmark is the best OpenAI currently has to offer, of course they not only have good parameters but also were fine tuned for that purpose. Fine tuning is something we can do once the Falcon large context parameters are chosen.

Turbo-16k

This function evaluates the Falcon neural network model on a given sequence of tokens.

First, it initializes various variables and parameters related to the model and the context. Then, it creates a ggml_context and ggml_cgraph to perform the computation. Next, it processes each layer of the model using a loop. For each layer, it performs self-attention, applies a linear transformation, and computes the output. After all the layers have been processed, the function applies batch normalization and computes the embeddings. Then, it applies the language modeling head to the embeddings. Finally, it computes the logits and updates the kv token count and storage buffer. The function also performs optional offloading to the GPU and prints timing information for each ggml operation. It returns the logits and embeddings as output.

GPT4 at 8k:

The function falcon_eval_internal is a critical piece of the Falcon Natural Language Processing model. It processes tokens through the trained Falcon model for evaluation and returns the generated logits. The function begins by initializing various values and parameters based on the inputs and the model configurations. It then sets up a computational graph and loops through all the layers in the model. In each layer iteration, the function performs a series of mathematical operations, such as matrix multiplications, rearrangements, and normalizations, to perform self-attention on the provided tokens. For each layer, the function also manages memory buffers and uses API calls to calculate embeddings and language model heads. After the layer loop, normalization is performed on the input layer, and the language model head is computed. Before final logits can be returned, the function checks if all versus only last token logits are required and manages memory accordingly. The function concludes by measuring and tracking the time taken for execution.

Overall Turbo as well as GPT4 provide a definitely better roundup, especially regarding hallucinations, not super convincing in all cases which is also caused by the code being above the understanding of any llm today.

cmp-nct commented 1 year ago

Update; Will be delayed for another day

Results with 40B are quite good, though calculating something like perplexity appears a difficult task due to the performance loss at high context. It's 6+ times faster than before the recent KV cache PR but that's still too slow to use comfortably at 8k+ context. 7B on the other hand is fast, it's still good at 8k context but I've not had the same success with it. It loses attention at less than 3k context while 40B manages to stay focused beyond 8k.

cmp-nct commented 1 year ago

Well in the end I found that my elaborate new method was more than beaten by findings of two reddit users called bloc97 and emozilla. So after probably 12 hours continued debugging into optimizing ROPE dynamically by compressing the rotation space in various ways and struggling with Falcon 7B just not coping well with any change I stumbled on their findings. It's called "NTK aware scaling". I've been living behind a rock in Falconverse, that is known for almost 2 weeks..

After implementing that quite closely Falcon 7B gave me a brilliant response at 4k context (still > 30 tokens/sec generation and a good 8k response as well (see below) Falcon 7B at > 8k context

The code defines a function called compute_embeddings that takes in the input tensor, embedding, and returns an output tensor with the language modeling logits.

It uses the ggml_cuda_assign_buffers function to set up the GPU buffer for the input tensor, embedding, and the output tensor, logits.

It then uses the ggml_norm function to normalize the input tensor and the logits tensor.

It computes the language modeling head using the lm_head tensor.

It calculates the attention scores for each token in the input tensor using the ggml_mul function and the ggml_norm function.

It uses the ggml_mul function to multiply the attention scores with the embedding tensor, and then applies a softmax activation function to produce the language modeling logits.

It computes the attention scores for the last token using the ggml_mul function and the ggml_norm function.

It calculates the language modeling logits for the last token using the ggml_softmax function.

It uses the ggml_mul function to multiply the attention scores with the language modeling logits, and applies a softmax activation function to produce the language modeling logits.

It computes the language modeling logits for each token using the ggml_softmax function.

It calculates the language modeling logits for the last token using the ggml_mul function and applies a softmax activation function to produce the language modeling logits.

It uses the ggml_add function to add the language modeling logits for each token, and the attention scores for the last token.

It computes the language modeling logits for the last token using the ggml_add function and applies a softmax activation function to produce the language modeling logits.

It updates the kv token count with the language modeling logits for each token.

It extracts the embeddings from the output tensor, logits, and returns the tensor.

It frees the GPU buffer using the ggml_free function.

It calculates the performance metrics for the single-token evaluations.

It returns true if the code executed successfully.

If debug_timings is greater than 0, it prints timing information for each ggml operation.

The function returns true at the end of execution.

Falcon 40B results are now also in: The quality is much better than before, hallucinations are minimal now.

llama::falcon_compute() is a function that performs forward propagation in a transformer model, specifically the Falcon language model. It takes as input a context object ctx0, which contains information about the current layer being processed and the tensor buffers used for storing intermediate results. The function starts by initializing some variables, such as i_gpu_start, i_gpu_last, n_layer, and n_past. It then loops over each transformer layer in the model, starting from the current layer (il) to the last layer (model.layers.size()-1).

For each layer, it performs the following steps:

Computes the attention weights between the input token embeddings and the previous tokens' embeddings using a rope trick.

Applies a QKV rotation to the input token embeddings and the attention weights.

Performs element-wise multiplication between the rotated input token embeddings and the attention weights, followed by softmax normalization to obtain the attention scores.

Computes the key-value product using matrix multiplication and applies a scale factor to the result.

Applies a linear transformation to the attention scores and the output of the previous layer (KV_prev or Vmem).

Computes the dot product between the attention scores and the output of the previous layer, followed by softmax normalization to obtain the final output probabilities.

If necessary, applies a mask to the output probabilities to hide some tokens from the decoder.

Exports the output probabilities to disk for evaluation purposes (if lctx.export_perf is true).

Updates the token count used by the language model's attention mechanism.

Measures the time taken by this function and stores it in a variable t_p_eval_us.

If necessary, copies the embeddings tensor to an output buffer for evaluation purposes.

Returns true if all layers have been processed successfully, or false otherwise.

Note that this is just a summary of the key steps performed by the function, and there are additional details in the code comments that provide more information about each step. Additionally, the function uses some helper functions such as ggml_repeat(), ggml_permute(), ggml_diag_mask_inplace(), ggml_scale_inplace(), and ggml_copy().

https://github.com/cmp-nct/ggllm.cpp/pull/65 Merge tomorrow if no issues come up

vadi2 commented 1 year ago

How is the prompt processing performance like? I found that even at 3k context, the model was unfortunately not practical to use due to the tokenization speed on an RTX 4080.

cmp-nct commented 1 year ago

How is the prompt processing performance like? I found that even at 3k context, the model was unfortunately not practical to use due to the tokenization speed on an RTX 4080.

I'm working on that part, for a large prompt you need to use "-b" to process the prompt in batches. This is quite a bit flawed currently, I'm working on a overhaul already. -b 64 to 512 work best (memory increases non linearly with increasing context when using -b)

When using 7B 3k context is quite useable currently. Sadly the current release has prompt-cache broken, also fixing that ;) Next release will take a week or two.

cmp-nct commented 11 months ago

How is the prompt processing performance like? I found that even at 3k context, the model was unfortunately not practical to use due to the tokenization speed on an RTX 4080.

the current branch ggfalcon_dev did make some progress in terms of processing speed though kv cache is not done on GPU yet, that's the main limitation which gets larger with context

cmp-nct / ggllm.cpp

Upcoming PR - Pushing the Context limit to 8k+ for all existing Falcon models - Longrange Falcon flights #62