ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.38k stars 9.37k forks source link

llama : refactor the llm.build_xxx functions #5239

Open ggerganov opened 7 months ago

ggerganov commented 7 months ago

Now that we support a large amount of architectures, we can clearly see the patterns when constructing the compute graphs - i.e. optional biases, different norm types, QKV vs Q+K+V, etc.

We should deduplicate the copy-paste portions in functions such as llm.build_llama(), llm.build_falcon(), etc.

The advantage of the current code is that it is easy to look into the graph of a specific architecture. When we refactor this, we will lose this convenience to some extend. So we should think about making this refactoring in such a way that we don't completely obscure which parts of the graph belong to which architectures

Open for ideas and suggestions how to do this best

ngxson commented 7 months ago

While trying to implement the "steering vector" (but not success), I came a cross these build_* function and spent some time play around with it. I'd like to share my idea, purely from perspective of a programmer (I'm not very familiar with machine learning, I understand just some basic concepts)

This idea is in fact borrowed from tensorflow, where we can have describe one layer in one line of code, something like:

tf.keras.layers.Dense(units=64, activation='relu')

This can be implemented by creating a new struct called llama_cgraph_builder, and the syntax may looks like:

llama_cgraph_builder builder(ctx0, lctx, hparams, model, cb);
inpL = builder.inp_embd(...);

// the line above replaces "llm_build_inp_embd"
// for now, usage of "llm_build_inp_embd" is the same for all model architectures, so I don't know what to be used as argument for builder.inp_embd();

inp_pos = builder.inp_pos(...);
KQ_mask = builder.KQ_mask(...);
for (int il = 0; il < n_layer; ++il) {
    builder.set_layer(il);
    ...
}

By doing so, the builder holds all info about ctx0, lctx, hparams, model, cb and pass to the call, for example:

struct llama_cgraph_builder {
    struct ggml_context * ctx0;
    llama_hparams hparams;
    ...
    struct ggml_tensor * inp_embd() {
        struct ggml_tensor * tensor = llm_build_inp_embd(ctx0, hparams, batch, model.tok_embd, lctx.inp_tokens, lctx.inp_embd, cb);
        cb(tensor, "inp_embd", -1);
        return tensor;
    }
}

However, I don't know by doing so, do we lost some important visibility about the architecture? The only thing that I cannot be sure is that in each "smaller function" inside the builder, what should we exposed as arguments?


Another idea is we add the smaller function describe above into struct llm_build_context, don't need to create another struct

github-actions[bot] commented 6 months ago

This issue is stale because it has been open for 30 days with no activity.