ggerganov / llama.cpp

LLM inference in C/C++
MIT License
60.95k stars 8.7k forks source link

Control vector loading fixes #8137

Closed jukofyork closed 1 day ago

jukofyork commented 2 days ago
slaren commented 2 days ago

There are still ggml and gguf context leaks in some cases. It has been a while since I took a look at #6289, but I think the changes there are correct and it would be better to use it as the baseline.

jukofyork commented 2 days ago

There are still ggml and gguf context leaks in some cases. It has been a while since I took a look at #6289, but I think the changes there are correct and it would be better to use it as the baseline.

Yeah, sorry I just pushed what I had so far but I'm busy incorporating those fixes now :)

jukofyork commented 2 days ago

I've completely refactored llama_control_vector_load_one() to remove the unnecessary 2-passes and introduce a single point of return to lessen the likelihood of leaks in the future.

Can you have a quick look over it to see if there is anything you can see that I've done wrong:

static llama_control_vector_data llama_control_vector_load_one(const llama_control_vector_load_info & load_info) {

    llama_control_vector_data result = { -1, {} };

    ggml_context * ctx = nullptr;
    struct gguf_init_params meta_gguf_params = {
        /* .no_alloc = */ false,
        /* .ctx      = */ &ctx,
    };
    struct gguf_context * ctx_gguf = gguf_init_from_file(load_info.fname.c_str(), meta_gguf_params);
    if (!ctx_gguf) {
        fprintf(stderr, "%s: failed to load control vector from %s\n", __func__, load_info.fname.c_str());
        ggml_free(ctx);
        return result;
    }

    int32_t n_tensors = gguf_get_n_tensors(ctx_gguf);
    if (n_tensors == 0) {
        fprintf(stderr, "%s: no direction tensors found in %s\n", __func__, load_info.fname.c_str());
    }

    for (int i = 0; i < n_tensors; i++) {
        std::string name = gguf_get_tensor_name(ctx_gguf, i);

        int layer_idx = -1;

        // split on '.'
        size_t dotpos = name.find('.');
        if (dotpos != std::string::npos && name.substr(0, dotpos) == "direction") {
            try {
                layer_idx = std::stoi(name.substr(dotpos + 1));
            } catch (...) {
                layer_idx = -1;
            }
        }
        if (layer_idx < 0) {
            fprintf(stderr, "%s: invalid/unparsable direction tensor layer index in %s\n", __func__, load_info.fname.c_str());
            continue;
        } else if (layer_idx == 0) {
            fprintf(stderr, "%s: invalid (zero) direction tensor layer index in %s\n", __func__, load_info.fname.c_str());
            continue;
        }

        struct ggml_tensor * tensor = ggml_get_tensor(ctx, name.c_str());
        if (tensor->type != GGML_TYPE_F32) {
            fprintf(stderr, "%s: invalid (non-F32) direction tensor type in %s\n", __func__, load_info.fname.c_str());
            continue;
        }
        if (ggml_n_dims(tensor) != 1) {
            fprintf(stderr, "%s: invalid (non-1D) direction tensor shape in %s\n", __func__, load_info.fname.c_str());
            continue;
        }

        if (result.n_embd == -1) {
            result.n_embd = ggml_nelements(tensor);
        } else if (ggml_nelements(tensor) != result.n_embd) {
            fprintf(stderr, "%s: direction tensor in %s does not match previous dimensions\n", __func__, load_info.fname.c_str());
            continue;
        }

        // extend if necessary - do not store data for layer 0 (it's not used)
        result.data.resize(std::max(result.data.size(), static_cast<size_t>(result.n_embd * layer_idx)), 0.0f);

        const float * src = (const float *) tensor->data;
        float * dst = result.data.data() + result.n_embd * (layer_idx - 1);  // layer 1 at [0]
        for (int j = 0; j < result.n_embd; j++) {
            dst[j] = src[j] * load_info.strength;
        }

    }

    gguf_free(ctx_gguf);
    ggml_free(ctx);

    return result;
}

I've tried to keep with the "continue" instead of break or hard-exit if possible for both functions now and tried to make the errors a little more informative about using layer 0, etc.

I've tested it very briefly on mistral using 3 sets of control vectors and it appears to work OK, but will double-check again later after some food when feeling fresher, so don't merge yet! :)

jukofyork commented 2 days ago

I've tested it some more using mistral with 4 sets of control vectors and some extra debugging output shows it's correctly resizing the .data vector in both locations.

The failed test looks to be something totally unrelated and is a missing "common.h" header.


Two final things to think about before merging:

  1. If we should allow multiple direction vectors for the same layer in the same file?

IMO, this should be allowed and just have them summed as we do for the separate files. This will also tie in with my intention of adding "control projectors" next as these will be combined using V.V^T instead of summation as it does likely make sense to want to perform the orthogonal projection using a subspace of rank-2 or more. Allowing multiple directions in the "control vectors" file although redundant, doesn't really do any harm I think?

If it's decided against having this, then I think the code should probably warn the user as currently it just overwrites the old layer's data...

If we want to go with this change then I think the only change needed is this:

dst[j] += src[j] * load_info.strength;

as the result.data.resize(..., 0f) call above should have already initialized the data to zeros.

  1. Can we change the name of llama.cpp::llama_control_vector_apply() to something more appropriate? I think possibly llama_control_vector_set might be the best choice as this is ultimately what it is doing:
    for (size_t il = 1; il < model.hparams.n_layer; il++) {
        assert(cvec.tensors[il] != nullptr);

        const size_t off = n_embd * (il - 1); // buffer doesn't have data for layer 0, since it's never present
        if (off + n_embd <= len) {
            ggml_backend_tensor_set(cvec.tensors[il], data + off, 0, n_embd * ggml_element_size(cvec.tensors[il]));
        }
    }
jukofyork commented 2 days ago

I pushed this change to see if it will clear the failed tests:

dst[j] += src[j] * load_info.strength;
slaren commented 2 days ago

The code looks good to me. I think it would be better to change the continue to break and abort the process in case of error, it seems very unlikely that the result will correct if there are errors in the control vector file, so it is better to report the error and abort. That's the way we handle models in llama.cpp, eg. a model will fail to load even if it has all the expected tensors, but it also has some additional unused ones.

1. If we should allow multiple direction vectors for the same layer in the same file?

I don't really have an opinion about this, it seems pointless if they are just going to be summed in the end, but if there are other cases where this separation may be necessary then I guess it is ok.

2. Can we change the name of llama.cpp::llama_control_vector_apply() to something more appropriate? I think possibly llama_control_vector_set might be the best choice as this is ultimately what it is doing:

I agree that llama_control_vector_set could be a better name, but I don't think it is an good idea to make a breaking API chance for this reason only. It might make more sense to do that if it is part of a larger refactor of the implementation.

jukofyork commented 2 days ago

The code looks good to me. I think it would be better to change the continue to break and abort the process in case of error, it seems very unlikely that the result will correct if there are errors in the control vector file, so it is better to report the error and abort. That's the way we handle models in llama.cpp, eg. a model will fail to load even if it has all the expected tensors, but it also has some additional unused ones.

I've made both functions abort loading any more control vectors and also clear the .data member on abort. This should hopefully avoid the case that confused me (as a user), of loading some control vector files but not others:

//
// Control vector utils
//

static llama_control_vector_data llama_control_vector_load_one(const llama_control_vector_load_info & load_info) {
    llama_control_vector_data result = { -1, {} };

    ggml_context * ctx = nullptr;
    struct gguf_init_params meta_gguf_params = {
        /* .no_alloc = */ false,
        /* .ctx      = */ &ctx,
    };
    struct gguf_context * ctx_gguf = gguf_init_from_file(load_info.fname.c_str(), meta_gguf_params);
    if (!ctx_gguf) {
        fprintf(stderr, "%s: failed to load control vector file from %s\n", __func__, load_info.fname.c_str());
        ggml_free(ctx);
        return result;
    }

    int32_t n_tensors = gguf_get_n_tensors(ctx_gguf);
    if (n_tensors == 0) {
        fprintf(stderr, "%s: no direction tensors found in %s\n", __func__, load_info.fname.c_str());
    }

    for (int i = 0; i < n_tensors; i++) {
        std::string name = gguf_get_tensor_name(ctx_gguf, i);

        int layer_idx = -1;

        // split on '.'
        size_t dotpos = name.find('.');
        if (dotpos != std::string::npos && name.substr(0, dotpos) == "direction") {
            try {
                layer_idx = std::stoi(name.substr(dotpos + 1));
            } catch (...) {
                layer_idx = -1;
            }
        }
        if (layer_idx < 0) {
            fprintf(stderr, "%s: invalid/unparsable direction tensor layer index in %s\n", __func__, load_info.fname.c_str());
            result.n_embd = -1;
            break;
        } else if (layer_idx == 0) {
            fprintf(stderr, "%s: invalid (zero) direction tensor layer index in %s\n", __func__, load_info.fname.c_str());
            result.n_embd = -1;
            break;
        }

        struct ggml_tensor * tensor = ggml_get_tensor(ctx, name.c_str());
        if (tensor->type != GGML_TYPE_F32) {
            fprintf(stderr, "%s: invalid (non-F32) direction tensor type in %s\n", __func__, load_info.fname.c_str());
            result.n_embd = -1;
            break;
        }
        if (ggml_n_dims(tensor) != 1) {
            fprintf(stderr, "%s: invalid (non-1D) direction tensor shape in %s\n", __func__, load_info.fname.c_str());
            result.n_embd = -1;
            break;
        }

        if (result.n_embd == -1) {
            result.n_embd = ggml_nelements(tensor);
        } else if (ggml_nelements(tensor) != result.n_embd) {
            fprintf(stderr, "%s: direction tensor in %s does not match previous dimensions\n", __func__, load_info.fname.c_str());
            result.n_embd = -1;
            break;
        }

        // extend if necessary - do not store data for layer 0 (it's not used)
        result.data.resize(std::max(result.data.size(), static_cast<size_t>(result.n_embd * layer_idx)), 0.0f);

        const float * src = (const float *) tensor->data;
        float * dst = result.data.data() + result.n_embd * (layer_idx - 1);  // layer 1 at [0]
        for (int j = 0; j < result.n_embd; j++) {
            dst[j] += src[j] * load_info.strength;  // allows multiple directions for same layer in same file
        }

    }

    if (result.n_embd == -1) {
        fprintf(stderr, "%s: skipping %s due to invalid direction tensors\n", __func__, load_info.fname.c_str());
        result.data.clear();
    }

    gguf_free(ctx_gguf);
    ggml_free(ctx);

    return result;
}

llama_control_vector_data llama_control_vector_load(const std::vector<llama_control_vector_load_info> & load_infos) {
    llama_control_vector_data result = { -1, {} };

    for (const auto & info : load_infos) {
        auto cur = llama_control_vector_load_one(info);

        if (cur.n_embd == -1) {
            result.n_embd = -1;
            break;
        }
        if (result.n_embd != -1 && result.n_embd != cur.n_embd) {
            fprintf(stderr, "%s: control vectors in %s does not match previous dimensions\n", __func__, info.fname.c_str());
            result.n_embd = -1;
            break;
        }

        if (result.n_embd == -1) {
            result = std::move(cur);
        } else {
            result.data.resize(std::max(result.data.size(), cur.data.size()), 0.0f);  // extend if necessary
            for (size_t i = 0; i < cur.data.size(); i++) {
                result.data[i] += cur.data[i];
            }
        }
    }

    if (result.n_embd == -1) {
        fprintf(stderr, "%s: no valid control vector files passed\n", __func__);
        result.data.clear();
    }

    return result;
}
  1. If we should allow multiple direction vectors for the same layer in the same file?

I don't really have an opinion about this, it seems pointless if they are just going to be summed in the end, but if there are other cases where this separation may be necessary then I guess it is ok.

It was more a case of should it trigger an error, but I think since it's just getting summed then it is probably valid/OK to have.

  1. Can we change the name of llama.cpp::llama_control_vector_apply() to something more appropriate? I think possibly llama_control_vector_set might be the best choice as this is ultimately what it is doing:

I agree that llama_control_vector_set could be a better name, but I don't think it is an good idea to make a breaking API chance for this reason only. It might make more sense to do that if it is part of a larger refactor of the implementation.

I'll leave off this then, but my next task after this is to look at introducing the "control projectors" and may look to refactor that then.


I actually think this code for the llama_control_vector_data struct would have been better to use a std::map<int, std::vector<float>> for the .data member and will likely have to use std::map<int, std::vector<std::vector<float>>> for the "control projectors" to allow separation of the multiple vectors per layer for the V.V^T (or equivalent v_1.v_1^T + v_2.v_2^T + ...) calculation.

I know @ggerganov is very anti-template due to the increased compile times, so just checking if it is likely to be OK to use a std::map like this? I think the overhead will be minimal (each vector is likely 4-8k of contiguous 4-byte floats) and the non-contiguous memory layout not much of a factor either (ie: only a couple of MB and likely used to initialise a GGML tensor down the chain anyway).

I will leave any of this for the next PR anyway and I think this PR is done now unless you can see anything else?

slaren commented 1 day ago

I know @ggerganov is very anti-template due to the increased compile times, so just checking if it is likely to be OK to use a std::map like this?

std::map and other templates from the C++ standard library are used everywhere in llama.cpp, it is ok to use them.

jukofyork commented 1 day ago

I know @ggerganov is very anti-template due to the increased compile times, so just checking if it is likely to be OK to use a std::map like this?

std::map and other templates from the C++ standard library are used everywhere in llama.cpp, it is ok to use them.

I think this may not be such a good idea after all:

    LLAMA_API int32_t llama_control_vector_apply(
            struct llama_context * lctx,
                     std::map<int, std::pair<std::vector<float>, std::vector<std::vector<float>>>> data,
                         int32_t   n_embd,
                         int32_t   il_start,
                         int32_t   il_end);

:grin:

I think to start with I will just allow a single "control projector" per layer and think later how to do this without the map of doom...

If I keep 2 separate scale factors (called strength in the code) then I can actually use the same control vectors for both purposes and it will be very minimal changes up until the need to call the ggml_matmul(), etc code.

I think I'll leave it a couple of days and come back fresh though, so will look again next week at this.

slaren commented 1 day ago

Ah yes, the llama.cpp public API is strictly a C API, if this function is part of the llama.cpp API then it cannot take a C++ class as a parameter.

jukofyork commented 1 day ago

Ah yes, the llama.cpp public API is strictly a C API, if this function is part of the llama.cpp API then it cannot take a C++ class as a parameter.

Yeah, I think it wasn't a good idea trying to use a map even if not. I think I should be able to get this working very easily for a maximum of 1 projected direction, but might need some help again by the time I get the the ggml calls (likely early next week now).