ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.9k stars 9.3k forks source link

Improve `cvector-generator` #8724

Open ngxson opened 1 month ago

ngxson commented 1 month ago

Ref: https://github.com/ggerganov/llama.cpp/pull/7514

Note: I'm writing this for tracking, but people are free to do PRs if you have any ideas to share. I'm planning to do some of the tasks in August 2024.

mhnghfv commented 1 month ago

I've looked into the code and related material and it seems like this PCA idea is broken. A normal PCA implementation would:

  1. Center the data by subtracting the mean
  2. Compute a matrix of covariances between the dimensions of the data (the covariance matrix)
  3. Compute eigenvectors of covariance matrix.

The hardest part is the last part, so in #6880 , @jukofyork suggested a simple algorithm that can calculate the largest eigenvector (which easy to do efficiently compared to trying to do all eigenvectors at once), which would give the 1st principle component. You have however seemed to misinterpreted that as being the whole of the PCA algorithm. This resulted in your confusion about the embedding matrix not being square in #7514 - covariance matricies are always square so this is not a problem for PCA, but you skipped the first two parts of the algorithm and tried to feed the input directly into step 3. @christianazinn saw that this is how it was done in SciPy but said "We may care to implement this after everything else works" and instead suggested substituting it with multiplying the matrix by its transpose as a temporary fix, but since this would still be the wrong matrix (not the covariance matrix), the algorithm would not work correctly until the real covariance matrix was implemented.

However, even if PCA was implemented correctly, it might still not work. This is because you're not supposed to take the PCA of the difference between the positive and negative vectors. If we look at the paper that #5970 referenced, we see that in "Constructing the PCA Model" they specify "we do not use any labels by default unless explicitly stated otherwise, making the procedure unsupervised" - this is an unsupervised method, meaning it takes random unlabeled data (not positive and negative examples) and finds features on its own. Another quote from the paper - "PCA - We take an unlabelled dataset D that primarily varies in the concept of interest. We take the top PCA direction that explains the maximum variance in the data."

Some of the wording in the papers seems to imply that running PCA on the differences would make sense, but if you think about how PCA is supposed to work, you would expect that to fail, since the first step centers the data, which would subtract the average difference and obliterate the main direction of difference between the positive and negative examples, leaving PCA to find some other unrelated feature left behind after that. Random pairing would not have this issue since random differences would not average to anything meaningful and therefore keep the negative-positive feature intact. Randomizing the order/direction of the pairing (e.g. positive-negative vs negative-positive) would also alleviate this issue, which I expect might be the reason why the paper reports good performance of PCA on paired differences, though there might also be some other reason that I'm missing. But even in the paper, PCA is used mainly as an unsupervised method, while mean difference is used as a supervised method.

The blog post referenced in #5970 seems to be confused by this and suggest taking PCA of differences in the blog post. If you look at github issues for the control vector library that is used in the blog, vgel/repeng, people report it not working for some models other than the one the author used (issue 33). But it does appear to work in the blog post and some other related discussion, so maybe there's something about PCA that counterintuitively works here despite what would be would expected.

Your implementation also seems like it could accidentally work in some situations, since you are multiplying a matrix of differences by its transpose, which is somewhat vaguely similar to how correlation is calculated, though differences to the mean are used rather than paired differences. But for examples that vary significantly in some direction from the mean, properly paired differences would be roughly in the same direction as difference from the mean so your matrix would accidentally compute something PCA-like, which could work, sometimes, and only when the direction of the largest difference is aligned with the desired negative-positive difference. Indeed, your examples didn't work as expected for me when I tested them, the control vector produced sad/dramatic outputs for both positive and negative scale values. The model I was using was not llama however so maybe it really does work on llama for some reason.

TL;DR PCA is not supposed to be used for paired negative-positive examples, and the implementation is broken anyway. My personal opinion is that it should either be removed it or the original unsupervised control vectors should be implemented with a correct PCA implementation. Mean-difference method should be correct afaik and can be kept as is.

jukofyork commented 1 month ago

Finding the eigenvectors of a data matrix multiplied by its transpose is a valid way of computing PCA so long as you mean center the data matrix first (if you use SVD on the data matrix and then take the right singular vectors then you don't need to mean centre though).

I think the main problem with the current code is it's missing the final step where you project the data onto the vector to decide if the sign of the vector needs flipping - without this step the control vectors are very broken.

@ngxson said in a different thread he was busy with his job at but going to look at this soon.

If he doesn't then I will have a look at fixing it - I'm just not that familiar with the ggml stuff and it looks like quite a time investment to be sure of not leaking memory and/or doing other stupid stuff :/

Computing the first PCA component only is actually really easy and doesn't really need to calculate the covariance matrix at all: use gradient descent to train an autoencoder with a single scaler using least squares loss (aka Oja's rule). You can extend this idea to get the first few PCA components too (by repeated subtraction of the component then reruning) but as the eigenvalues get smaller and smaller if starts to suffer from numerical problems.

jukofyork commented 1 month ago

I've actually found what I think is a better way to do this though via the cross-covariance matrix:

https://github.com/jukofyork/control-vectors

The basic idea is to generate 3 samples for each prompt: baseline, negative and positive. Then subtract the positive and negative from the baseline (instead of mean centering). Then finally find the eigenvectors of the cross-covariance matrix of these two matrices (and use either the discriminant ratio or variance reduction to find the eigenvector you want and some other stuff - see the code here ).

The reason this seems to work better is it specifically looks for directions that are "balanced" in terms of their vector norms from the baseline. The cross-covariance matrix can only have a large value if both are large (and is maximised when both are equidistant from the baseline) whereas the covariance matrix can have a large value if one is almost the same as the baseline.

mhnghfv commented 1 month ago

Finding the eigenvectors of a data matrix multiplied by its transpose is a valid way of computing PCA so long as you mean center the data matrix first (if you use SVD on the data matrix and then take the right singular vectors then you don't need to mean centre though).

I mean, I did mention that in

which is somewhat vaguely similar to how correlation is calculated, though differences to the mean are used rather than paired differences

difference to mean being the same as centered data. (instead of correlation I should have said covariance). But it's still the case the original algorithm can (should?) be used on unlabeled data unsupervised. Maybe we should somehow allow mixing in labeled with unlabeled data for easy semi-supervised training

mhnghfv commented 1 month ago

The basic idea is to generate 3 samples for each prompt: baseline, negative and positive. Then subtract the the positive and negative from the baseline (instead of mean centering). Then finally find the eigenvectors of the cross-covariance matrix of these two matrices

sounds like a good idea

jukofyork commented 1 month ago

difference to mean being the same as centered data. (instead of correlation I should have said covariance). But it's still the case the original algorithm can (should?) be used on unlabeled data unsupervised

I've tried many different methods over the last month or so and I think what you mean is: concatonate both datasets and find the principal eigenvector? If so, then this doesn't always work because often the principle (and even the 2nd) eigenvector will actually not be the direction that forms an axis between the two things you want, eg: in terms of creative writing trying to use "positive" and "dark" examples often ends up with "story writing" as the first principle component. You can do some post processing like in the code I linked above to find which direction is the correct one, but it is still inferior to the cross-covariance matrix method from my experiments.

jukofyork commented 1 month ago

There is a lot of my thoughts in this thread:

https://huggingface.co/jukofyork/creative-writing-control-vectors-BETA-v0.1/discussions/2

and the different iterations of things I've tried are here:

https://huggingface.co/collections/jukofyork/creative-writing-control-vectors-66859c95443833e90f8a7f84

The latest control vectors use soft-thresholding as a form of L1-regularisation (the scale factor in llama.cpp can be thought of as similar to L2-regularisation).

The biggest limitation of the method is only being able to effect a single "writing trait" at a time - vector addition somewhat works but if very finicky and blended training data ends up with much less discrimination between the classes...

The transformers code won't actually let me get the inputs to the down_proj matrix, but if you can get samples of both input and output then it may be possible to modify the down_proj matrix as a "cheap" form of fine tuning (ie: back prop only through the down_proj matrix using least squares, use contrastive or triplet loss, and so on).

It may turn out you still need the control vectors though (to act as a bias / affine transformation) as from experiments with Householder transformations on down_proj it seems very hard to make a model change it default behavior using linear transformations alone (ie: you can reflect and rotate a models' sense of "dark" and "positive" writing, but you can't actually get rid of "positively bias" like this).

ngxson commented 1 month ago

@mhnghfv Hi and thanks for having a look into the code. To give you a bit of context, the PCA implementation is initially proposed by @christianazinn . I'm not particularly good at math (since my background is software engineer), so to be honest, I would have been impossible for me to do this without invaluable help from him and @jukofyork .

However, the implementation is not guaranteed to be exactly correct, so we're currently depend on help from contributors. It would be very helpful if you can create a PR to showcase your approach.

A quick update for the current issue: I started some initial work in this branch, but it turns out to be far more complicated than I thought. It's true that simply subtracting negative - positive pairs is quite naive, but doing other way will require a refactoring. As I'm currently busy with other part of the project, I can't be confident that it can be finished very soon.

jukofyork commented 1 month ago

If you can get the code to the stage where the data matrices are just stored as plain C++ vectors or a straightforward chunk of C floats, then I can quite quickly get the rest of the code working just using plain C++ vectors (on the CPU) and then look at porting this to use the ggml GPU code if necessary.

I just can't spend the time needed to learn all the ins and outs of the ggml stuff atm, and I can't see an easy and straightforward way to pull apart the existing code as all the differencing logic is done online via ggml calls :/

If I could just get a couple of std::vector<std::vector<float>> to work from then I can bypass all the ggml stuff and it would super easy for me to do all the rest.

jukofyork commented 1 month ago

Even just some very barebones code to set up the callbacks, pulling the floats out of the ggml tensors and placing in a std::vector and correctly deallocating memory and/or closing any contexts so as not to leak memory or resources, and I would be good to go :)

jukofyork commented 1 month ago

A quick update for the current issue: I started some initial work in this branch, but it turns out to be far more complicated than I thought. It's true that simply subtracting negative - positive pairs is quite naive, but doing other way will require a refactoring. As I'm currently busy with other part of the project, I can't be confident that it can be finished very soon.

Yeah, I thought I might be able to fix the sign problem easily when ChuckMcSneed mentioned it in the other thread, but by having the matrices pre-differenced I couldn't see an easy way either :/

I think the pre-differenced matrix is the main blocker atm.

ngxson commented 1 month ago

Even just some very barebones code to set up the callbacks, pulling the floats out of the ggml tensors and placing in a std::vector and correctly deallocating memory and/or closing any contexts so as not to leak memory or resources, and I would be good to go :)

IIRC this is actually what @christianazinn did in the first version of the implementation. It works well for prototyping because it doesn't require playing with ggml, so it's also a good practice. Once everything works and you want to have proper GPU support, I can help to convert the implementation to ggml.

jukofyork commented 1 month ago

Even just some very barebones code to set up the callbacks, pulling the floats out of the ggml tensors and placing in a std::vector and correctly deallocating memory and/or closing any contexts so as not to leak memory or resources, and I would be good to go :)

IIRC this is actually what @christianazinn did in the first version of the implementation. It works well for prototyping because it doesn't require playing with ggml, so it's also a good practice. Once everything works and you want to have proper GPU support, I can help to convert the implementation to ggml.

Oh, I'll have a look tomorrow at the commit history then!

jukofyork commented 2 weeks ago

Just posting these here so I don't lose them:

Simple Algorithms for the Partial Singular Value Decomposition

The first "Power Method" algorithm looks fairly easy to adapt pca.hpp to be able to calculate truncated SVD. It should be a lot more efficient for rectangular matrices, eg: 2*O(n^2*m) vs O(n^3) where n << m (ie: hidden state dimension << number of samples). The eigenvectors are the same as the right singular values as explained here.

Sadly it won't allow for calculation of the Moore–Penrose pseudo-inverse via SVD as the error is proportional to the reciprocal of the smallest singular values:

image

But, this paper outlines an algorithm that looks to be reasonably easy to implement:

An iterative method to compute Moore-Penrose inverse based on gradient maximal convergence rate

image

jukofyork commented 2 weeks ago

@ngxson @slaren

I want to make the control vectors allow for low-rank transformations as well as just offsets/directions first, but after that I am going to see if I can fix up the current cvector-generator code (and possibly implement the algorithms above which might also be useful for https://github.com/ggerganov/llama.cpp/discussions/8831), but I have a couple of questions about the current pca.hpp implementation:

What I don't understand is:

  1. Why the compute graph doesn't seem to be reused and build_graph_piter() seems to get called every iteration? From: ggml.h:

// The described approach allows to define the function graph once and then compute its forward or backward graphs // multiple times. All computations will use the same memory buffer allocated in the ggml_init() function. This way // the user can avoid the memory allocation overhead at runtime.

Is there a reason for calling build_graph_piter() over and over like this?

  1. What is the purpose of the for (int i = 0; i < params.n_batch; ++i) loop:
for (int i = 0; i < params.n_batch; ++i) {
        // b_tensor = square * eigenvector^T
        b_tensor = ggml_mul_mat(ctx0, input_square, old_eigen);
        ggml_set_name(b_tensor, "b_tensor");

        // normalize
        b_tensor = ggml_div_inplace(ctx0,
            b_tensor,
            ggml_sqrt_inplace(ctx0, ggml_sum_rows(ctx0, ggml_sqr(ctx0, b_tensor)))
        );
        ggml_format_name(b_tensor, "b_tensor_norm_%d", i);

        // calculate distance(new eigenvector - old eigenvector)
        // we don't use ggml_sub because it may not be implemented on GPU backend
        struct ggml_tensor * new_sub_old = ggml_add(ctx0, old_eigen, ggml_scale(ctx0, b_tensor, -1));
        distance = ggml_sqrt_inplace(ctx0,
            ggml_sum_rows(ctx0, ggml_sqr_inplace(ctx0, new_sub_old)));
        ggml_format_name(distance, "distance_%d", i);

        old_eigen = b_tensor;

        // build operations nodes
        ggml_build_forward_expand(gf, distance);
    }

This looks more like an unrolling of the A(A(Ax))... matrix-vector multiply rather than any sort of batch/parallel processing and I can't see why this would be advantageous?

Is this related to the above and some way of lessening the extra overhead of the repeated creation of the compute graph?

jukofyork commented 2 weeks ago

@ngxson @slaren

Also related to the "make the control vectors allow for low-rank transformations as well as just offsets/directions" changes:

What's the standard way of passing complex objects into the llama.cpp file's functions? The problem I have is that llama.cpp is C++ yet the the llama.h header appears to be only C... So when I have a complex data-structure like nested std::vector<> or std::map<> it's very frustrating to pass as it has to be cast into a C-style data structure using raw pointers... :/

Passing as a void* and casting back isn't really a great solution (and frowned upon in a lot of places) but the alternative of crafting vectors of float*, float** and int* with lots of len parameters is almost as ugly and brittle as just using a void* :(

My GGUF saving code (that is based off the repeng/extract.py source) also adds these two parameters to the control vector files it creates:

        writer.add_string(f"{arch}.model_hint", self.model_type)
        writer.add_uint32(f"{arch}.layer_count", len(self.directions))

where the layer_count key holds the number of layers in the model (rather than the number of tensors which seems to be stored in the GGUF header anyway). I assumed this was going to be used as a sanity check inside of the llama.cpp::llama_control_vector_load_one() code, but it seems to be unused.

Adding this sanity check in would likely make the C++ --> C conversion of the data structures much less ugly than the current llama_control_vector_load_one() and llama_control_vector_load() logic.

AFAIK, nobody else is making control vectors, and I think I can add in the low-rank transformations I plan to add without breaking any backwards compatibility, but adding this sanity check may break control vectors what were created without it?

Is this an OK breaking change to make?

ngxson commented 2 weeks ago
  1. Why the compute graph doesn't seem to be reused and build_graph_piter() seems to get called every iteration?

build_graph_piter() is re-called for each iteration because the underlay memory maybe re-allocated (@slaren explained it once but I can't find the comment anywhere). In anyways, it's a normal behavior and building cgraph in this case does not have very big performance impact.

In addition to that, we also actually build cgraph to calculate in batch. This is controlled by params.n_batch. The cgraph of batch=1 looks like this, and for batch>1 it looks like this (the image is hidden under "Click to see graph image").

The advantage is that we can make GPU to do more at one go, but disadvantage is higher memory usage.

ngxson commented 2 weeks ago

AFAIK, nobody else is making control vectors, and I think I can add in the low-rank transformations I plan to add without breaking any backwards compatibility, but adding this sanity check may break control vectors what were created without it?

Yes you can make a PR for that. I agree that not many users are actively using control vector atm, so breaking changes is ok.

The problem I have is that llama.cpp is C++ yet the the llama.h header appears to be only C

Header file is plain C because many bindings (like llama-cpp-python) relies on FFI (and thus only support C interface).

It's better not to have a too complicated struct. Instead, the "C way" is to break into smaller structs and use multiple function calls to retrieve the data. I don't know how to better explain it, but that's the way to do when working with low-level programming.

jukofyork commented 2 weeks ago
  1. Why the compute graph doesn't seem to be reused and build_graph_piter() seems to get called every iteration?

build_graph_piter() is re-called for each iteration because the underlay memory maybe re-allocated (@slaren explained it once but I can't find the comment anywhere). In anyways, it's a normal behavior and building cgraph in this case does not have very big performance impact.

In addition to that, we also actually build cgraph to calculate in batch. This is controlled by params.n_batch. The cgraph of batch=1 looks like this, and for batch>1 it looks like this (the image is hidden under "Click to see graph image").

The advantage is that we can make GPU to do more at one go, but disadvantage is higher memory usage.

Thanks - the graphs are really helpful! :)

Won't this just block on the matrix-vector product that creates the b_tensor though? The matrix-vector product will be O(n^3) due to a square n x n matrix getting multiplied by a vector of size n and all the other operations will be O(n)? I can't see how this could help and any of the O(n^3) part could be calculated before the input vector is computed?

ngxson commented 2 weeks ago

The speed gain is not because of computational complexity, but thanks to the time it saves to transfer data between CPU <==> GPU. If the only one iteration is executed by GPU at a time, then the data will need to be copied CPU <==> GPU very frequently.

That's why when working with GPU, it's always better to tell the GPU to process in batch (it's both true when using GPU for ML and for 3D graphics)

jukofyork commented 2 weeks ago

AFAIK, nobody else is making control vectors, and I think I can add in the low-rank transformations I plan to add without breaking any backwards compatibility, but adding this sanity check may break control vectors what were created without it?

Yes you can make a PR for that. I agree that not many users are actively using control vector atm, so breaking changes is ok.

The problem I have is that llama.cpp is C++ yet the the llama.h header appears to be only C

Header file is plain C because many bindings (like llama-cpp-python) relies on FFI (and thus only support C interface).

It's better not to have a too complicated struct. Instead, the "C way" is to break into smaller structs and use multiple function calls to retrieve the data. I don't know how to better explain it, but that's the way to do when working with low-level programming.

Yeah, it's this function that is going to get incredibly ugly:

    // Apply a loaded control vector to a llama_context, or if data is NULL, clear
    // the currently loaded vector.
    // n_embd should be the size of a single layer's control, and data should point
    // to an n_embd x n_layers buffer starting from layer 1.
    // il_start and il_end are the layer range the vector should apply to (both inclusive)
    // See llama_control_vector_load in common to load a control vector.
    LLAMA_API int32_t llama_control_vector_apply(
            struct llama_context * lctx,
                     const float * data,
                          size_t   len,
                         int32_t   n_embd,
                         int32_t   il_start,
                         int32_t   il_end);

It's actually already padding data with zero-valued control-vectors due to having to be in a contiguous block of memory bounded by len, but it's going to get even worse for what I want to do next as unlike the control vectors; the low rank transformations vector pairs cannot be summed and have to be concatenated into two n x k matrices A and B.

In C++ I would just pass a std::vector<std::vector<float>> instead of data and len here (thus avoiding the need for wasteful zero padded control vectors), and either std::vector<std::vector<float>> or std::vector<std::vector<std::vector<float>>> for the low rank transformations (can use n_embd to deduce how many they hold for the former case).

jukofyork commented 2 weeks ago

The speed gain is not because of computational complexity, but thanks to the time it saves to transfer data between CPU <==> GPU. If the only one iteration is executed by GPU at a time, then the data will need to be copied CPU <==> GPU very frequently.

That's why when working with GPU, it's always better to tell the GPU to process in batch (it's both true when using GPU for ML and for 3D graphics)

Yeah, it does seem amazingly complex way of doing something so simple though! :)

Using Python or a C++ linear algebra library I could implement all that in 5-10 lines, and even using GSL or MKL it would only be 30-40 lines of C!

I think I may just go back to my idea of sampling the hidden states in llama.cpp, exporting the data file and externally computing the control vectors until I'm 100% settled on the final algorithm... Otherwise it looks really easy to end up having to do a complete rewrite each time it changes :/

ngxson commented 2 weeks ago

IMO you can use an "internal" struct for control vector and expose functions to create/manipulate it.

For example:

// internal cpp struct, don't need to define it in header file
struct llama_control_vector;

struct llama_control_vector * llama_control_vector_init(const char * file_path);
int32_t llama_control_vector_apply(struct llama_context ctx, struct llama_control_vector * cvec, int il, float scale);

// example usage
struct llama_control_vector * happy = llama_control_vector_init("...");
for (int il = 0; il < llama_n_layer(model); i++) {
  // only apply if il > 5 and < 10
  llama_control_vector_apply(ctx, happy, il, (il > 5 && il < 10) ? 1.0 : 0.0);
}

(Have a look on lora functions, it has the same idea)

jukofyork commented 2 weeks ago

IMO you can use an "internal" struct for control vector and expose functions to create/manipulate it.

For example:

// internal cpp struct, don't need to define it in header file
struct llama_control_vector;
struct llama_control_vector_apply_params;

struct llama_control_vector * llama_control_vector_init(const char * file_path);
int32_t llama_control_vector_apply(struct llama_control_vector * cvec, int il, float scale);

// example usage
struct llama_control_vector * happy = llama_control_vector_init("...");
for (int il = 0; il < llama_n_layer(model); i++) {
  // only apply if il > 5 and < 10
  llama_control_vector_apply(happy, il, (il > 5 && il < 10) ? 1.0 : 0.0);
}

If it is OK to make the breaking change of reading the .layer_count then all the logic of combining multiple control vectors inside of llama_control_vector_load() via calls tollama_control_vector_load_one(), and passing data tollama_control_vector_apply()becomes vastly simpler as I can just use astd::vectorand pass as aconst int*` to the C-function do all the bookkeeping!

I'll try and get a PR working later today to show what I mean.

ngxson commented 2 weeks ago

Yeah, it does seem amazingly complex way of doing something so simple though! :)

Using Python or a C++ linear algebra library I could implement all that in 5-10 lines, and even using GSL or MKL it would only be 30-40 lines of C!

Yeah in fact it's because support for GPU in ggml is added at later stage, so the new GPU-supported API is built with "backward compatibility" in mind. Btw if you want to discover ggml, I recommend to have a look at this blog post: https://huggingface.co/blog/introduction-to-ggml

jukofyork commented 2 weeks ago

Yeah, it does seem amazingly complex way of doing something so simple though! :) Using Python or a C++ linear algebra library I could implement all that in 5-10 lines, and even using GSL or MKL it would only be 30-40 lines of C!

Yeah in fact it's because support for GPU in ggml is added at later stage, so the new API is built with "backward compatibility" in mind. Btw if you want to discover ggml, I recommend to have a look at this blog post: https://huggingface.co/blog/introduction-to-ggml

Thanks - that looks a lot more helpful than what I managed to find in the ggml repo!

My main goal now is just to get the low-rank transformations working though as by being able to do "abliteration" via control vectors will likely make many more people become more aware of the other purposes, this post outlines the link:

The link between control vectors, "abliteration", Householder transformations and general affine transformations

I'm not actually that interested in the "abliteration" stuff, but hopefully more people might become interested in it (currently I've no idea who downloads my control vectors but there seems to be about 3 people max who have given any feedback.... lol).

jukofyork commented 2 weeks ago

This post also explains why I would need to implement the Moore–Penrose Inverse (or a least squares solver) for this:

https://github.com/jukofyork/control-vectors/discussions/2

and why it might be a good idea to completely lock down the algorithm before trying to implement it in ggml code...

jukofyork commented 2 weeks ago

IMO you can use an "internal" struct for control vector and expose functions to create/manipulate it.

For example:

// internal cpp struct, don't need to define it in header file
struct llama_control_vector;

struct llama_control_vector * llama_control_vector_init(const char * file_path);
int32_t llama_control_vector_apply(struct llama_context ctx, struct llama_control_vector * cvec, int il, float scale);

// example usage
struct llama_control_vector * happy = llama_control_vector_init("...");
for (int il = 0; il < llama_n_layer(model); i++) {
  // only apply if il > 5 and < 10
  llama_control_vector_apply(ctx, happy, il, (il > 5 && il < 10) ? 1.0 : 0.0);
}

(Have a look on lora functions, it has the same idea)

Yeah, I just looked at this: the llama_lora_adapter struct seems to get around the problem by having a forward-declaration in llama.h, its implementation in llama.cpp, and this in common.h:

struct llama_lora_adapter_container : llama_lora_adapter_info {
    struct llama_lora_adapter * adapter;
};

It looks a bit convoluted but will have a go at duplicating this setup for the control vectors.

slaren commented 2 weeks ago

build_graph_piter() is re-called for each iteration because the underlay memory maybe re-allocated (@slaren explained it once but I can't find the comment anywhere).

Just to clear this up, I am not sure what I said, but it is absolutely possible to evaluate the same graph any number of times. In transformers the graph changes with each evaluation due to the positions and size of the context changing after every token, so we rebuild them every time, but if the graph does not change it is not necessary to build it again.

@jukofyork prototyping with another library is probably a good idea. It shouldn't be too bad once you get used to it, but ggml certainly requires more boilerplate and may not be as flexible as other libraries.

jukofyork commented 2 weeks ago

Yeah, I think in this case the matrix-vector product dominates all the other computations as it's O(n^3) where n=12k for the largest models.

I think if we could just wrap the matrix-vector product then everything else could be done cheaply on the CPU.

Also since we have one of these to do for each layer, we could concatenate the columns of some or all of the (n, n) A_i matrices to create an (n, np) A' matrix and then compute `A' v'each iteration and extract the p differentv_isub-vectors fromv'`.

To compute truncated SVD using the algorithm from the 1987 paper I linked above we would need to also compute u' * A' each iteration too (in serial as u_t is updated using v_t).

This would have the disadvantage of not being able to stop early when a numerical tolerance is reached though...