ggerganov / llama.cpp

LLM inference in C/C++
MIT License
65.19k stars 9.34k forks source link

Store KV cache of computed prompts to disk to avoid re-compute in follow-up runs #64

Closed ggerganov closed 1 year ago

ggerganov commented 1 year ago

Idea from: https://github.com/ggerganov/llama.cpp/issues/23#issuecomment-1465308592

We can add a --cache_prompt flag that if added will dump the computed KV caches of the prompt processing to the disk in a file with name produced by the hash of the prompt. Next time you run, it will first check if we have stored KV cache for this hash and load it straight from disk instead of computing it.

Great task for contributing to the project!

Msa360 commented 1 year ago

@ggerganov Do you want to store k, v or K, V?

im-not-tom commented 1 year ago

Hello and sorry to bother you,

I made a attempt at doing this and would like to ask for correction, as probably gravely misunderstood something.

Without dumping my entire code, I have two functions that basically boil to memcpy-ing content model.memory_[kv]->data in and out. My understanding is that this, along with saving/restoring n_past should be enough to save current state of prompt.

Am I mistaken? I'm, of course, asking simply because result of my attempt is state that appears entirelly random, often outputing nothing at all or garbage text.

bool gptj_make_savepoint(const struct gptj_model & model, gptj_savepoint & savepoint) {                       
    size_t nelements = ggml_nelements(model.memory_k);  
    assert(nelements == ggml_nelements(model.memory_v));
    savepoint.memory.clear();
    assert(ggml_type_size(model.memory_k->type) == sizeof(float));                          
    assert(ggml_type_size(model.memory_v->type) == sizeof(float));                          
    savepoint.memory.resize(nelements * 2);                                                 
    memcpy(                                                                                 
        &savepoint.memory[0],                                                               
        ggml_get_data(model.memory_k),                                                      
        sizeof(float) * nelements                                                           
    );                                                                                      
    memcpy(                                                                                 
        &savepoint.memory[nelements],                                                       
        ggml_get_data(model.memory_v),                                                      
        sizeof(float) * nelements                                                           
    );                                                                                      
    return true;                                                                            
}                                                                                           

bool gptj_apply_savepoint(const gptj_savepoint & savepoint, struct gptj_model & model) {                            
    size_t nelements = savepoint.memory.size() / 2;                                    
    assert(nelements == ggml_nelements(model.memory_k));                                
    assert(nelements == ggml_nelements(model.memory_v));                               

    memcpy(                                                                            
        ggml_get_data(model.memory_k),                                                 
        &savepoint.memory[0],                                                          
        sizeof(float) * nelements                                                      
    );                                                                                 
    memcpy(                                                                            
        ggml_get_data(model.memory_v),                                                 
        &savepoint.memory[nelements],                                                  
        sizeof(float) * nelements                                                      
    );                                                                                 

    return true;                                                                       
}                                                                                      
setzer22 commented 1 year ago

@im-not-tom I got this working on my project and those are basically the steps I followed :+1:

I have verified this works by saving memory, restoring on a different process (after loading the model again), and then comparing the two logits vectors returned by llama_eval (the one from the original process and the one from the new process). If you do it well, the results should be identical.

LostRuins commented 1 year ago

This requires a future input prompt to be identical to the original prompt, correct? So if I wanted to swap out a few words early in the text each time, it would still require reprocessing everything.

I'm struggling to understand, why does Huggingface Transformers not have this issue? It seems like generating from a 500 word input prompt has the same latency as a 3 word prompt there (constant time). Whereas with llama.cpp, the prompt processing time scales linearly with prompt length.

anzz1 commented 1 year ago

This a great idea. As explained in my comment here https://github.com/ggerganov/llama.cpp/issues/23#issuecomment-1477080912

I think the right way going forward would be to separate the concepts of the immutable model and its' mutable context to their own separate, simple structs.

Those could be then manipulated and evaluated against in various ways any new features and implementations would see fit. Load / save from disk, mmap to memory , transfer over sockets , even from hardware devices like a serial port. Serve multiple instances by creating threads or share the model and/or context between processes.

Currently I see the problem of having multiple concurrent PR's , each of which try to implement new functionality by directly modifying the main program in their different ways. A simple C API to access the 'model' and 'context' structs could keep the main program lean, clean and fast and have the ability to add all sorts of new functionality using separate modules which could interface with this API.

You've done absolutely fantastic job making this minimal, fast and ultra-portable zero-dependency port of LLaMA. It's absolutely delightful in its' current state and I think modularity would be the right approach moving forward instead of growing the main program to a large monolith with a web of non-portable #ifdefs scattered around everywhere. With every functionality-adding module living inside its separate .cpp file independent of each other, any functionality could be simply added or left out by the makefile script.

I'd see that this could spawn a whole environment where a new "modules" directory could be added to the repo root, then people could make whatever modules which can add new functionality. Living inside the modules folder and being separated from the main program and being included by makefile options, they could also less strictly conform to the rules of no dependencies and full platform compatibility. Allowing people to make new functionality without taking into account every platform and also have the ability of opting-in to the features they want and have nothing forced upon them.

If a non-modular approach would be taken, it would inevitably lead this marvelous minimalistic codebase to grow to a large monolith and force people to fork the minimal version and cherry-pick the bugfixes and features they want each in their own forks, creating a divergence which in my view would hurt the project in the long run.

KerfuffleV2 commented 1 year ago

When implementing the feature, this comment may be useful: https://github.com/setzer22/llama-rs/issues/38#issuecomment-1477677482

TL;DR is memory_k and memory_v are compressible proportional to how much of the context space is used. For example, if --ctx_size 512 and you save the state after feeding 1-2 tokens in you get about a 95% reduction, if you feed it 256 tokens you get around 50%, if you feed it 511 tokens you get virtually no compression. However, that's an edge case.

zstd compression at the lowest/fastest setting works well and increasing the compression level doesn't do a lot. Since the memory is quite large (2GB with --ctx_size 2048), being able to save something like the Alpaca instruction prefix stuff with 30MB is a huge difference compared to having to load/save a massive 2GB file.

So I think it's really worth it to use some sort of lightweight compression scheme for at least the memory tensors when saving/restoring state.

xaedes commented 1 year ago

I have implemented functions for getting and setting the rest of the model state. It additionally includes: random number generator state, logits, embedding and kv_cache.

It was necessary to store the logits so that we can eval tokens, save state, restart program, load state and then sample. Otherwise the sampling did not have access to the required logits and indeed segfaulted on the initially empty logits vector.

For completeness I also stored the embedding vector.

Because the whole state is not in one contiguous memory buffer I decided on an output pointer parameter to get the state data. The user is responsible to allocate the memory where the state is written to.

https://github.com/xaedes/llama.cpp/commit/075f5f5bf081da91e28e8881f34e883fd4651272

abetlen commented 1 year ago

@xaedes do you have a PR for this?

xaedes commented 1 year ago

Just created the pull request: https://github.com/ggerganov/llama.cpp/pull/1105

ejones commented 1 year ago

I believe #1169 covers this