ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.99k stars 9.62k forks source link

ggml-backend should avoid memset'ing brand new memory #7474

Closed jart closed 5 months ago

jart commented 5 months ago

If I use Phi-3 128k on CPU then the first thing GGML does is allocate ~40gb of anonymous memory and then memset(0)'s it. This imposes a long startup delay on the order of seconds.

Here's my feature request. Operating systems always grant programs zero-initialized memory. Therefore memset() can be avoided the first time a fresh CPU allocation is used. Most calloc() implementations are smart enough to avoid the initial zeroing cost by tracking what memory hasn't been handed out yet.. It would be great if ggml-backend could do this too.


Side question: Why does GGML implement its own memory allocators for CPU? I tried writing a pull request that fixes this issue, but I find it difficult to understand ggml-alloc.c. I would like to see GGML naively use malloc(), calloc(), and free() for each logical allocation, and only do clever things for performance critical cases like op wdata.

ggerganov commented 5 months ago

I don't think we are memsetting the memory of the weights. The only memset is done for the KV cache buffers through ggml_backend_buffer_clear():

https://github.com/ggerganov/llama.cpp/blob/fbf777d2b9c30e7569e3d1c149501c1e31d9b5b9/llama.cpp#L2458-L2470

This is done in a backend agnostic way in order to be able to zero initialize the GPU buffers as well and is done after the allocation process so that it is optional - not all buffers need to be zeroed.

Maybe you are allocating the full 128k context which makes the KV cache quite big? If you really need the entire context of 128k tokens, then the startup time would be quite negligible compared to the full processing time for such big context. If you don't need the entire context, then you can reduce the KV cache size and the zeroing should be again negligible

Why does GGML implement its own memory allocators for CPU?

It was one of the very first features of ggml to avoid memory allocations at runtime - everything (weights, intermediate tensors, work buffers, etc.) went into a single memory buffer that had to be pre-allocated and it's size had to be manually estimated to fit the worst-case scenario. With ggml-alloc this is now much more generic and allows to automatically estimate the necessary buffer size based on the graphs that would be computed.

I am not sure what is the real value of avoiding dynamic memory allocations, and it very well might be a net negative in the grand scheme since we often have to do some acrobatics to achieve basic things. But it's something nested into the core of the project and I don't plan to change it

jart commented 5 months ago

Thank you for taking the time to thoughtfully respond to my concerns. Avoiding malloc() the way you did originally in GGML is generally a good guiding principle for designing C libraries that you want to be able to be used in contexts like signal handlers where malloc() isn't allowed. There's also a great deal of code in C libraries and kernels that exists upstream of malloc() which obviously can't use malloc(). I think GGML has reached a level of sophistication these days where being applied to those sorts of innermost gutsy applications isn't practical, so I'd love to see a refactoring / simplification of the dynamic memory code happen one day, in the event you should reconsider. However I can't realistically expect any more from you here that you've already given. Thank you.