ggerganov / llama.cpp

LLM inference in C/C++
MIT License
64.12k stars 9.18k forks source link

Can't get server to run with more than 6 slots #5247

Closed vashat closed 4 months ago

vashat commented 6 months ago

Hi!

Trying to run the server with more slots that 6 by setting the parameters -np and -cb like this:

./server -m models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf --host 0.0.0.0 --ctx_size 8192 -t 8 -ngl 99 -np 10 -cb

I set -np 10, the output confirms that I have 10 slots, but when I do 10 parallell requests only 6 of the requests are getting output, the rest of the slots are being queued and only start to get output once the first 6 slots are done. Any way to make output work on all 10 slots at the same time? Or is there a hard limit at 6 slots and more slots than -np 6 are ignored?

pudepiedj commented 6 months ago

I can open 10 simultaneous connections to the server over my home network using the IP address of the machine where it is running, and although I can't tell whether there would be a delay if I were ten different people using it simultaneously, it all seems to work and the logs show 10 simultaneous users being served.

How are you running this test? Using python API calls or browser connections or what? This is using mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf I can't quite run the Q5 on my hardware.

vashat commented 6 months ago

@pudepiedj I'm running the test by using the web interface that comes with the server (it is at the web root of the api). I just open 10 Chrome tabs with the web interface, fill them with a prompt and then submit the prompt in quick succession in every tab. First 6 tabs start working right away streaming the answer, but the last 4 wait until one of the first 6 are finished, then they start streaming their answer. Whatever value I set for -np parameter over -np 5 has the same result, only the first 6 slots start streaming initially. It does not matter which size or type of model I use it always behaves the same, so you should be able to reproduce it with a smaller model.

pudepiedj commented 6 months ago

@pudepiedj I'm running the test by using the web interface that comes with the server (it is at the web root of the api). I just open 10 Chrome tabs with the web interface, fill them with a prompt and then submit the prompt in quick succession in every tab. First 6 tabs start working right away streaming the answer, but the last 4 wait until one of the first 6 are finished, then they start streaming their answer. Whatever value I set for -np parameter over -np 5 has the same result, only the first 6 slots start streaming initially. It does not matter which size or type of model I use it always behaves the same, so you should be able to reproduce it with a smaller model.

@vashat I am afraid that I am unable to reproduce this behaviour. Running mixtral-8x7b-instruct-v0.1.q4_K_M.gguf on an M2 MAX with 32GB and -np 10 everything seems to work correctly when I run a Python request.post() loop from another machine over my internal network. Here is a sample from the server console logging:

slot 5 is processing [task id: 315]
slot 5 : kv cache rm - [0, end)
slot 2 is processing [task id: 316]
slot 9 is processing [task id: 317]
slot 6 is processing [task id: 318]
slot 1 is processing [task id: 319]
slot 4 is processing [task id: 320]
slot 8 is processing [task id: 321]
slot 0 is processing [task id: 322]
slot 7 is processing [task id: 323]
slot 3 is processing [task id: 324]

I hit resource problems with more than -np 10 and -c 4096 although I have been able to run -np 30 on -c 4096 successfully provided I start with a clean installation after rebooting the machine. It doesn't recover from failures at all well. Of course, running 30 users in 4096 means each only gets a cache of about 135, so the performance isn't good, but it does run. Here's the server setup for the original example in full:

make -j && ./bin/server -m ../models/Mixtral-8x7b/mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf -ngl 10 -c 4096 -np 10 --host 0.0.0.0

The model does some very strange things, inventing questions and answers, but that's a separate issue. Please let me know if you think something in the way I have set this up has somehow bypassed your problem. (I did try opening 10 tabs in Chrome, but the result was the same: no issues.) My main discovery is that the server.cpp is pretty fragile, and crashes the machine when I try anything ambitious, but I suspect that has more to do with setting the iogpu.wired_limit_mb=27500 than the server code.

vashat commented 6 months ago

https://github.com/ggerganov/llama.cpp/assets/1157672/ff6bdc9f-036a-4500-a8d5-dc5b32323488

@pudepiedj I added a video that demonstrates the problem. And also the output from the server is visible in the video in case you can see what goes wrong. My command is:

./server -m models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf --host 0.0.0.0 --ctx_size 8192 -t 8 -ngl 99 -np 10 -cb

I also tried your parameters, but the behavior was even weirder without the -cb parameter. I also tried -ngl 0 to only run on cpu, but no difference in behavior.

pudepiedj commented 6 months ago

@vashat It is possible that we are dealing with nothing more obscure than the system maxing out. If I am running mixtral-8x7b-v0.1.3Q_K_M.gguf and sending a sequence of completion requests from a Python script, the available memory and the GPU both max out long before all the -np 16 users are serviced. There is clearly much that can slow things down other than a bug. It might be worth trying with a very small model. Sorry not to be any more help; I think someone who understands the server better than I do will need to pick this up.

vashat commented 6 months ago

@pudepiedj I've tried a model 1/10th the size but still same problem, slots after the first 6 slots always have to wait. Have tried both on local machine Apple M3 Max 48GB compiled with Metal and on AWS with llama.cpp compiled with cuBlas support. I have tried finding some hard limit in the server code, but haven't succeeded yet. Thank you for trying to reproduce the problem, I will continue digging in the server code to try to understand what is going on.

pudepiedj commented 6 months ago

@vashat If I set -np 16 and -c 4096 then the kvcache for each slot is obviously 512 and it says it has set up all 16 slots, but sending 25 queued /completion requests from a python script still only triggers 10 slots to start processing, and there is a lot of shifting during processing by dumping half the cache periodically. Your kvcaches are 819 per slot so if you have 48GB you could try increasing the usable VRAM to something like 40GB using

sudo sysctl iogpu.wired_limit_mb=40000

(it'll ask for your password) and setting -c 16384 or as much as remains within the VRAM limit to give caches of around 1638 and see if that makes any difference. But on my M2 MAX the GPU is completely maxed out and I don't think the M3 GPU is much more powerful so if that's happening with you, it may make no difference. But if a disproportionate amount of the GPU work is being used to do cache shifting, making the caches larger should help.

ggerganov commented 6 months ago

Hm, this is unexpected - looks like a bug in the server logic

pudepiedj commented 6 months ago

@ggerganov I have been looking at this behaviour more carefully and have now replicated @vashat 's results albeit with different numbers. I think the following are all true of it, even if on my system the server seems to run 10 processes where on @vashat 's it runs 6. The evidence for all this is a small utility based around the way python's tqdm module displays progress using ANSI characters. The screen shots below are taken from this process. The driver was a python script that threw 40 queued POST requests at the /completion endpoint. It did in the end run through them all and terminate, but very slowly, and obviously one wouldn't want to use the graphics in a working server, but it helped at least me to see what was happening.

This is the server initiation I used. I only used the Q3_K_M model to conserve resources and because the actual answers were of little interest in this experiment:

% make -j && ./bin/server -m ../models/Mixtral-8x7b/mixtral-8x7b-v0.1.Q3_K_M.gguf -ngl 10 -c 4096 -np 32
  1. At inception the server runs the process in Slot 0 to completion. It then runs 1-10, pretends to run 11, and when 1-10 finish runs 11-20 etc. At the end it pretends to start process 31 and others in the 0-10 range but they only really run once 21-30 have all terminated.

    image
  2. Like @vashat I can find nothing in the code to explain why the server seems to have a preference for running 10 slots at a time even when others are available.

    image
  3. In each tranche there appears to be a logjam where one process, usually that with the highest number, fills the kvcache but does not perform a context shift. Thus Slot 10 will reach 128/128 and remain there while the others are performing context shifts. It does eventually terminate, but usually last. Here it is slot 17.

    image
  4. The server does not reuse slots once they finish their tasks until every available slot has been used at least once. Thus Slot 0 will only be reused once all of the other slots have run once. This seems to have the consequence that if the number of slots is small, say 2, and the number of requests larger than 2, the whole thing terminates with only 2 requests processed.

    image
  5. Slot 31 and Slots 0-9 start simultaneously once the others have finished, which again suggests that something it imposing a preference for batches of 10 slots, but I can't find it in the code.

    image
  6. When unable to find contiguous slots large enough for a batch the batch-size is halved repeatedly until it either finds a block or reaches 1. This is as intended. But during this process the kvcache increments by 1 each time a new batch-size is tried. Is this intended?

  7. The batch-size starts at 512 regardless of the maximum kvcache size, so this start-value could be restricted to the maximum kvcache size to save a little processing time when we need to search for smaller contiguous blocks.

  8. As always I apologise in advance if any of these supposed 'results' are spurious because of failures in my coding, but there does seem to be something amiss somewhere.

pudepiedj commented 6 months ago

Issue #4 above arises from the get_slot function which finds all the available slots and chooses the last. I think it should just find the first one that is free and then break. (There is something odd about the first untimed test, which never seems to find anything, but that is WIP. I don't think slot.is_available is being [re]set correctly, but I also don't yet quite understand what the json_value is doing that passes a slot.id to the function from around line 1650. Watch this space!) So:

llama_client_slot *get_slot(int id)
    {
        int64_t t_last = ggml_time_us();
        llama_client_slot *last_used = nullptr; // default return if no slot is available; should we look for first used available?

        for (llama_client_slot &slot : slots)
        {
            if (slot.id == id && slot.available())  // this seems only to select slot 5 !!!
            {
                LOG_TEE("\033[4+%d;0Slot initiated (untimed; marked as available) = %d\n", slot.id, slot.id);
                return &slot;
            }

            if (slot.available() && slot.t_last_used < t_last)  // seems always to select slot 0 and one other ...
            {
                LOG_TEE("\033[4+%d;0H\033[KSlot initiated (timed and finished) = %d\n", slot.id, slot.id);
                last_used = &slot;
                t_last = slot.t_last_used;
                break;  // we only want one and preferably the first one; this works to that effect
            }
        }
        if (last_used != nullptr) {
            LOG_TEE("\033[4+%d;0HSlot %d selected as available\n", last_used->id, last_used->id);
        }
        return last_used;
    }

The logging will be unnecessary eventually, but it now shows that once a slot is released it is immediately reused, so it's useful. Obviously tell me if there are reasons not to reuse slots in this way.

I still haven't found out why the queue_tasks size seems to limit to 10 despite having limitless capacity; is it something to do with the mutex lock and associated lambda? @ggerganov what sets the limit to the number of tasks queued initially? If I log the queue size at the start of void process_tasks it rises to 10 regardless of how many more requests are being sent, reduces to 1 and then never seems to rise above 1 again. Once slots are reused this effectively means we never have more than 10 slots processing.

image

Since there is no apparent limit to the queue_tasks.max_size() variable, something must be stopping it such as a limit I can't find on how many tasks are allowed.

pudepiedj commented 6 months ago

A few further observations.

  1. The note from @ggerganov (aka 'GG') almost at the end of server.cpp about it crashing if threads are put in main(): I had the same problem when just using make -j but if I run cmake --build . --config Debug and then run in debug mode in VSC with a suitable launch.json, it doesn't crash and the debug process more or less works.
  2. Running with a breakpoint in main() at svr.Post("/completion", ...) exposes all the parameters. Some of them must control why the later call to llama.request_completion(); first processes one request task 0, then processes 10 all at once tasks 1-10, then returns to processing only one at a time as earlier slots are released and reuses them, and this regardless of how many slots are available. So something controls this and prevents more than 10 requests from being queued. I suspect parameters in httplib.h. WIP.
  3. llama_batch_init() is called with n_ctx; should it be n_ctx_slot? Otherwise it tries to allocate params.n_parallel copies of n_ctx if I understand it rightly, and if I change it to n_ctx_slot it runs perfectly well and allows me to use -c 16384 with -np 16.
  4. Interesting experiment is to make n_discard something random rather than always half n_left using n_discard = rand() % (n_left/2).
  5. Still finding a problem with one kvcache reaching, say, 1024/1024 and jamming. Not convinced this is working correctly.
  6. mixtral-8x7b-v0.1.Q2_K.gguf and mixtral-8x7b-v0.1.Q3_K_M.gguf are good to test the server but their responses are almost completely useless. Is this a model or a server problem?
pudepiedj commented 6 months ago

@vashat @ggerganov I think the limit is set inside httplib.h by

#ifndef CPPHTTPLIB_THREAD_POOL_COUNT
#define CPPHTTPLIB_THREAD_POOL_COUNT                                           \
  ((std::max)(8u, std::thread::hardware_concurrency() > 0                      \
                      ? std::thread::hardware_concurrency() - 1                \
                      : 0))
#endif

I tried setting 8u to 32u and it suddenly allowed -np 32. It actually only allows 31 because the -1 reserves a thread for the system. Take a look at yours and see what value is there where I have 8u.

image

I will put up a PR with the graphics code later in case it would be of interest.

ggerganov commented 6 months ago

The graphics look interesting :) But I'll need some more context about what we are looking at

Regarding the httplib.h threads - that's strange limitation. Are we somehow misusing the library? Ideally, we don't want to have a limit on the number of parallel requests being processed.

pudepiedj commented 6 months ago

This all started from @vashat 's observation that he couldn't get more than 6 slots to run at once. I had the same issue but the limit was 10. So even when sending 40 request.post calls from python it would never queue more than 11 (task 0 which it ran on its own, then tasks 1-10 which it queued then ran all at once as soon as task 0 is finished; even this isn't right but that behaviour persists). I couldn't find what was causing this but eventually tracked it down to the ThreadPool as shown. Altering that changes the behaviour completely and now even if I only have -np 16 it will queue 32 or whatever number I replace the 8u with; seems to have no system implications so far. You would know far better than me. In the absence of a functional debugging option, which I can't make work, running my code with -skvi will expose this behaviour with some LOG_TEE lines.

Edit: @ggerganov Thinking about it overnight and at the risk of stating the obvious, if the ggml-metal backend can cope with a virtually unlimited number of parallel processes subject to sufficient resources such as GPUs and VRAM (is there a theoretical limit?), then we certainly don't want a constraint arising from the httplib.h, which is really designed for CPUs and certainly derives hardware_constraints from the CPU, to limit the number of simultaneous slots we can run. There may, of course, be some implications of CPU cores for setting up the GPUs, but that's a separate issue. What I think we should do temporarily is set the upper limit on the CPPHTTPLIB_THREAD_POOL_COUNT parameter high and then later create a previous definition that will silence the #ifndef compilation where the value is set by your ggml-metal code. Otherwise, as you rightly say and @vashat observed, this value will permanently limit how large -np N can effectively be, even on machines with huge resources. Does that sound reasonable?

I have now tried with 128u (a maximum ThreadPool of 128 threads - without the graphics!), sent it 256 sequential but uncontrolled request.posts and this is what happens ...

Some time later ... It ran the 256 requests on -np 64 with -c 16384 and a maximum thread count of 128, but maxed out everything. (It is also worth noting that the responses were worthless nonsense, but that's another story.)

From the llama.log after numerous diagnostic LOG_TEE additions:

image

There is a task queue which is the 'Waiting list', 64 slots that can process, and a deferred queue where tasks are stored that request slots when there are none available. So here task 69 gets a slot and starts processing but task 70 is denied and is deferred.

image

Later slots 42 (no, I really didn't make that up) and 62 are released, and so the queued task 70 can be accommodated along with task 71:

image

And the deferred task queue goes on accepting rejected tasks while the threads queue eventually fills up at 128 tasks:

image

The log then shows that something goes wrong with the deferred task queue because shortly after we see that it has dropped back to 0. I am investigating this (WIP) or I haven't understood its operation properly:

image

The log suggests that release and/or whatever triggers the logging of tokens somehow zeroes the deferred task queue. This just shows other operations and that the last of 256 requests reached the port:

image

Very long log! Might be worth an option to turn off logging of tokens to shorten it.

A few other observations I think the logic of the get_slot can be improved because the call-by-task-id never seems to be used; -1 is by far the most common task_id.

I also think there may be a bug in how the task_id parameter id gets incremented. At the moment we are incrementing it inside get_new_id in utils.hpp and also incrementing it when we assign an id to a free task in post() where we put task_id = id++; but this potentially increments the id twice; should the latter just be task_id = id;?

Can we correct the spelling of formated to formatted in the four get_formated_generation calls?

Sorry if this is stupid, but when we call llama_batch_init(n_ctx, 0, params.n_parallel) don't we create n_parallel batches of size n_ctx when it should be of size n_ctx_slot? If I do that, as in the long run above, it still seems to run and allows a much larger kvcache within given resources. Maybe I am just being blind as usual, and maybe this explains the nonsensical answers.

kvgraphics Each row of the graphic is a slot (picture from an earlier, smaller run). Automatically adjusts from -np N. The halfway is the context-shift limit. As it stands it shows how each kvcache fills, context-shifts and eventually terminates. The numbers show the token occupancy out of the total n_ctx_slot size; the symbols indicate IDLE and PROCESSING; the second column is slot.command; the last shows a cross if the kvcache fills. Sometimes it fills and stalls and takes a very long time to recover. I think this is to do with how the size of n_batch is restored to its original value after we reduce it incrementally to find a contiguous slot in the cache, which doesn't quite seem to work in all cases where it reduces to 1. All this graphics is selected using flags in the command-line inception using -skvg just to show the graphics; -skvi with interaction that pauses the execution while the screen messages are read. Default is OFF! Purely diagnostic. You can see the code on my repo at pudepied/llama. Everything is in server.cpp except the id++ tweak to utils.hpp. Definitely not intended for merging. Lots of extra LOG_TEE lines all over. Just about to push the latest version. The graphics function is kvgraphics around line 322.

When's the 10 million token version being merged? 🥇

pudepiedj commented 6 months ago

This little video shows the server running 60 simultaneous post completion requests with model gemma-2b-it.gguf sent from another laptop on the local network with api key validation. I have suppressed the LOG_INFO printouts from the latest commits because they disrupt the display. Invoked with -skvg to display the graphic (default off); -skvi allows pausing between requests but it is not necessary. I can run so large a -c 32768 on my machine because I use llama_branch_init(n_ctx_slot,...) instead of llama_branch_init(n_ctx, ...), which seems to work. Should it?

% make -j && ./bin/server -m ../models/gemma-2b-it.gguf -ngl 99 -c 32768 -np 64 -n 1024 --api-key-file ../apikeys.txt --host "0.0.0.0" -cb -skvg

https://github.com/ggerganov/llama.cpp/assets/26062177/4e1a3601-c027-459e-91f1-282af565fdf5

phymbert commented 6 months ago

@pudepiedj What are the issue with the LOG_INFO ? you can now switch to text format with --log-format text.

pudepiedj commented 6 months ago

@pudepiedj What are the issue with the LOG_INFO ? you can now switch to text format with --log-format text.

There is no issue: it's nicely implemented; although my preference would be for all logging to go to date-stamped files rather than displays. I suppressed it only because it interferes with the display output. If there were any interest in incorporating this kind of display in server.cpp - is there? - the best option would be to switch off or redirect the logging whenever the -skvg or -skvi options are chosen, but I didn't write it for any other purpose than to understand the server mechanism better. Incidentally, I notice that in the server_log() function we are using std::cout but I was told by @ggerganov that this is not permitted in this project. Has that changed?

phymbert commented 6 months ago

@pudepiedj What are the issue with the LOG_INFO ? you can now switch to text format with --log-format text.

There is no issue: it's nicely implemented; although my preference would be for all logging to go to date-stamped files rather than displays. I suppressed it only because it interferes with the display output. If there were any interest in incorporating this kind of display in server.cpp - is there? - the best option would be to switch off or redirect the logging whenever the -skvg or -skvi options are chosen, but I didn't write it for any other purpose than to understand the server mechanism better. Incidentally, I notice that in the server_log() function we are using std::cout but I was told by @ggerganov that this is not permitted in this project. Has that changed?

No we actually need to continue effort on logging server refactoring... Included your good idea about completing removing logs, --log-disable is misleading here.

pudepiedj commented 6 months ago

Thanks. Just for information: I have partly/mostly implemented an enhanced api_key process that allows a json file with each record {username: [apikey, usercodename]} which still works with just the apikey but could be extended. If the logging is a priority, I can change the parsing of the args so that log output goes to a file when graphics are implemented, but it will involve passing at least a filename and an extra boolean to server_log I think. Probably also needs some consultation and agreement about what we log (or don't log) and when (and how much). We should also enable --log_file FILENAME on server.cpp so we can specify an alternative to the default (or make the default take a datetime qualifier llama.$(date ...).log so it isn't constantly overwritten). I also think an option to turn off logging the tokenisation would help to reduce the inordinate length of some log files and make them more useful to the general user.

compilade commented 6 months ago

Sorry if this is stupid, but when we call llama_batch_init(n_ctx, 0, params.n_parallel) don't we create n_parallel batches of size n_ctx when it should be of size n_ctx_slot? If I do that, as in the long run above, it still seems to run and allows a much larger kvcache within given resources. Maybe I am just being blind as usual, and maybe this explains the nonsensical answers.

@pudepiedj No, this is not what happens (at least, from my understanding). It doesn't create n_parallel batches. The third parameter of llama_batch_init only changes how many seq_id (sequences) a token in the batch can be part of (in fact, it could probably even be 1 in the server example because it only ever assigns a single sequence per token, and for a n_ctx of 32768, and -np 64, it would save... around 8 MiB (meanwhile, the total size of the llama_batch in this case is around 8.5 MiB (Huh, interesting; storing the list of seq_id per token takes most of the space in that buffer)).

I can run so large a -c 32768 on my machine because I use llama_branch_init(n_ctx_slot,...) instead of llama_branch_init(n_ctx, ...), which seems to work. Should it?

Probably not. Regarding the first parameter, I think it needs to be n_ctx, and not n_ctx_slot because new tokens from all of the slots are added to the same batch for parallel processing (note that this big batch is actually split in smaller batches of a max size of n_batch when llama_decode is run, so unless there's something I don't understand going on, it should not really affect your VRAM usage). When using n_ctx_slot for this, you risk a buffer overflow when the sum of the amount of new tokens processed in parallel is greater n_ctx_slot, even if no slot individually uses more than n_ctx_slot tokens. With -c 32768 -np 64 and using n_ctx_slot for llama_batch_init, you assume that there will never be more than 512 tokens sent from all of the 64 clients combined at a time. Each client would only have to simultaneously make a request with more than 8 tokens to overflow the buffer. It seems this worked for now only because the requests of the clients are not all simultaneously processed at the beginning (not sure why, but only one slot seems to be assigned at a time (in this case, the max size of the batch could even be n_ctx_slot + n_parallel, unless more than one client gets a slot at a time (then using n_ctx is safer here, and doesn't really use much more memory in absolute terms))). Hope this helps.

ggerganov commented 6 months ago

The third parameter of llama_batch_init only changes how many seq_id (sequences) a token in the batch can be part of (in fact, it could probably even be 1 in the server example because it only ever assigns a single sequence per token, and for a n_ctx of 32768, and -np 64, it would save... around 8 MiB (meanwhile, the total size of the llama_batch in this case is around 8.5 MiB (Huh, interesting; storing the list of seq_id per token takes most of the space in that buffer)).

Yes, this is correct.

Regarding the first argument of llama_batch_init(): Think about it as the maximum amount of tokens that you are ever going to submit to llama_decode() in a single call. In server (and all other examples) we have a logic to chunk very large batches into a maximum of n_batch smaller batches.

There is also work by @slaren that I think will soon automatically perform this chunking within the llama library, which should simplify the logic in the user code.

pudepiedj commented 6 months ago

This is incredibly helpful and clarifies something I had misunderstood about the architecture, so I very much appreciate that two such busy people have taken the trouble to explain it. My little video therefore shows the kvcaches filling up as successive batches of parallel chunks are fed to the model. Once a process finishes, that share of the batch is spread amongst the rest so that (unless the slot is reallocated) they all gradually speed up until only one is left because it has now been given the whole batch to itself. The kvcache for each slot contains pointers to the chunks belonging to that process from successive batches that help reference the most recent ones. Making the first argument of llama_batch_init much smaller will make the whole process either much less efficient (because it forces chunks to be tiny) or lead to a system crash because the required sizes will exceed what is allocated (a situation which I guess the code should try to trap as an exception).

mscheong01 commented 5 months ago

I just reviewed this issue today for the first time, and based on @vashat's initial report and demonstration, I believe the browser could be the cause. Web browsers impose a restriction on the maximum number of connections per host, and Chrome's limit is 6, which coincides exactly with the observed limit in this case.

image
github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.