LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with a KoboldAI UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.41k stars 319 forks source link

SSE-streaming endpoint problem #250

Closed Vladonai closed 1 year ago

Vladonai commented 1 year ago

I use SSE-streaming endpoint (/api/extra/generate/stream) in my application. I notice that with every request the prompt is not handled completely, but only some small part of it. Although in the console the prompt is exactly as I send it. The smartcontext event never occurs. Is this a bug or am I using this endpoint incorrectly?

LostRuins commented 1 year ago

if you're using the streaming endpoint it should handle the entire prompt as you send it. I assume you are using your own custom client? Why do you say only a small part is used?

Vladonai commented 1 year ago

Processing Prompt message in the console - Processing Prompt (9 / 9 tokens), (8/8 tokens), (12/12 tokens)... And the prompt keeps getting bigger and bigger - 200, 300, 500 tokens. The smartcontext event doesn't occur even when the prompt reaches 1000 tokens. After ~900 tokens the model completely loses the previous context.

LostRuins commented 1 year ago

Is your client sending multiple requests? Can you run koboldcpp with --debugmode and see if your client is the one making multiple requests?

Vladonai commented 1 year ago

Input: {"n": 1, "max_context_length": 2048, "max_length": 500, "rep_pen": 1.19, "temperature": 0.79, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 0.95, "rep_pen_range": 2048, "rep_pen_slope": 0.9, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "You: Hello!\nJohn:", "quiet": true, "stop_sequence": ["You:", "\n"]} 127.0.0.1 - - [19/Jun/2023 22:08:02] "POST /api/extra/generate/stream HTTP/1.1" 200 -

[Debug: Dump Input Tokens, format: 5] ' (1)', ' You (887)', ': (29901)', ' Hello (15043)', '! (29991)', '\n (13)', 'John (11639)', ': (29901)',

[Debug: Context Size = 0]

Processing Prompt (8 / 8 tokens) Generating (1 / 500 tokens) [( Hi 78.71%)] Generating (2 / 500 tokens) [( there 60.79%)] Generating (3 / 500 tokens) [(. 52.39%)] Generating (4 / 500 tokens) [( How 53.87%)] Generating (5 / 500 tokens) [( can 80.18%)] Generating (6 / 500 tokens) [( I 96.25%)] Generating (7 / 500 tokens) [( help 73.53%)] Generating (8 / 500 tokens) [( you 96.40%)] Generating (9 / 500 tokens) [(? 72.93%)] Generating (10 / 500 tokens) [(\n 46.38%)]

(Stop sequence triggered: <

) Time Taken - Processing:1.4s (171ms/T), Generation:2.4s (242ms/T), Total:3.8s (2.6T/s) Output: Hi there. How can I help you?

127.0.0.1 - - [19/Jun/2023 22:08:06] "POST /api/extra/generate/stream HTTP/1.1" 200 -

Input: {"n": 1, "max_context_length": 2048, "max_length": 500, "rep_pen": 1.19, "temperature": 0.79, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 0.95, "rep_pen_range": 2048, "rep_pen_slope": 0.9, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "You: Hello!\nJohn: Hi there. How can I help you?\nYou: How are you?\nJohn:", "quiet": true, "stop_sequence": ["You:", "\n"]} 127.0.0.1 - - [19/Jun/2023 22:11:39] "POST /api/extra/generate/stream HTTP/1.1" 200 -

[Debug: Dump Input Tokens, format: 5] 'You (3492)', ': (29901)', ' How (1128)', ' are (526)', ' you (366)', '? (29973)', '\n (13)', 'John (11639)', ': (29901)',

[Debug: Context Size = 18] ' (1)', ' You (887)', ': (29901)', ' Hello (15043)', '! (29991)', '\n (13)', 'John (11639)', ': (29901)', ' Hi (6324)', ' there (727)', '. (29889)', ' How (1128)', ' can (508)', ' I (306)', ' help (1371)', ' you (366)', '? (29973)', '\n (13)',

Processing Prompt (9 / 9 tokens) Generating (1 / 500 tokens) [( I 51.47%)] Generating (2 / 500 tokens) [(' 76.82%)] Generating (3 / 500 tokens) [(m 99.98%)] Generating (4 / 500 tokens) [( doing 69.46%)] Generating (5 / 500 tokens) [( well 57.87%)] Generating (6 / 500 tokens) [(, 87.53%)] Generating (7 / 500 tokens) [( thanks 78.37%)] Generating (8 / 500 tokens) [( for 95.65%)] Generating (9 / 500 tokens) [( asking 99.98%)] Generating (10 / 500 tokens) [(. 97.69%)] Generating (11 / 500 tokens) [(How 44.11%)] Generating (12 / 500 tokens) [( are 89.04%)] Generating (13 / 500 tokens) [( you 92.23%)] Generating (14 / 500 tokens) [(? 87.52%)] Generating (15 / 500 tokens) [(\n 80.88%)]

(Stop sequence triggered: <

) Time Taken - Processing:1.6s (174ms/T), Generation:3.7s (250ms/T), Total:5.3s (2.8T/s) Output: I'm doing well, thanks for asking.How are you?

127.0.0.1 - - [19/Jun/2023 22:11:44] "POST /api/extra/generate/stream HTTP/1.1" 200 -

LostRuins commented 1 year ago

Okay so that sounds like an issue with your client. What UI are you using? it seems to be spamming the API with the individual requests. As you can see, each new longer processing is a new request sent by the UI.

LostRuins commented 1 year ago

Also as you can see, each request is generating completely different tokens/responses.

Vladonai commented 1 year ago

For each new request:

var parameters = new
{
    // Parameters
};
var content = new StringContent(JsonConvert.SerializeObject(parameters), Encoding.UTF8, "application/json");
var httpClient = new HttpClient
{
    Timeout = Timeout.InfiniteTimeSpan
};
var jsonContent = JsonConvert.SerializeObject(parameters);
HttpRequestMessage request = new HttpRequestMessage(HttpMethod.Post, "http://localhost:5001/api/extra/generate/stream");
request.Content = content;
request.Headers.Accept.Add(new MediaTypeWithQualityHeaderValue("application/json"));
try
{
    using (var response = await httpClient.SendAsync(request, HttpCompletionOption.ResponseHeadersRead).ConfigureAwait(false))
    {
        using (var stream = await response.Content.ReadAsStreamAsync())
        {
            using (var reader = new StreamReader(stream))
            {
                while (!reader.EndOfStream)
                {
                    var line = await reader.ReadLineAsync();
                    if (line.StartsWith("data:"))
                    {
                        var dataJson = line.Substring("data:".Length).Trim();
                        var tokenObject = JsonConvert.DeserializeObject<dynamic>(dataJson);
                        var token = tokenObject.token;
                        myPrompt += token;
                        // Processing...
                    }
                    await Task.Delay(50);
                }
            }
        }
    }
}
finally
{
    request.Dispose();
    httpClient.Dispose();
}

What am I doing wrong?

LostRuins commented 1 year ago

Have you tried using the sync endpoint? Is it working for you?

Vladonai commented 1 year ago

(http://localhost:5001/api/v1/generate) Pseudostream, as I understand it. Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.19, "temperature": 0.79, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 0.95, "rep_pen_range": 1024, "rep_pen_slope": 0.9, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "You: Hello!\nJohn:", "quiet": true, "stop_sequence": ["You:", "\n"]}

[Debug: Dump Input Tokens, format: 5] ' (1)', ' You (887)', ': (29901)', ' Hello (15043)', '! (29991)', '\n (13)', 'John (11639)', ': (29901)',

[Debug: Context Size = 0]

Processing Prompt (8 / 8 tokens) Generating (1 / 8 tokens) [( Hi 78.71%)] Generating (2 / 8 tokens) [( there 60.79%)] Generating (3 / 8 tokens) [(. 52.39%)] Generating (4 / 8 tokens) [( How 53.87%)] Generating (5 / 8 tokens) [( can 80.18%)] Generating (6 / 8 tokens) [( I 96.25%)] Generating (7 / 8 tokens) [( help 73.53%)] Generating (8 / 8 tokens) [( you 96.40%)]

Time Taken - Processing:1.2s (152ms/T), Generation:1.8s (227ms/T), Total:3.0s (2.6T/s) Output: Hi there. How can I help you 127.0.0.1 - - [19/Jun/2023 22:42:34] "POST /api/v1/generate HTTP/1.1" 200 -

Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.19, "temperature": 0.79, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 0.95, "rep_pen_range": 1024, "rep_pen_slope": 0.9, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "You: Hello!\nJohn: Hi there. How can I help you", "quiet": true, "stop_sequence": ["You:", "\n"]}

[Debug: Dump Input Tokens, format: 5] ' you (366)',

[Debug: Context Size = 15] ' (1)', ' You (887)', ': (29901)', ' Hello (15043)', '! (29991)', '\n (13)', 'John (11639)', ': (29901)', ' Hi (6324)', ' there (727)', '. (29889)', ' How (1128)', ' can (508)', ' I (306)', ' help (1371)',

Processing Prompt (1 / 1 tokens) Generating (1 / 8 tokens) [(? 72.93%)] Generating (2 / 8 tokens) [(\n 46.38%)]

(Stop sequence triggered: <

) Time Taken - Processing:0.3s (273ms/T), Generation:0.3s (135ms/T), Total:0.5s (3.7T/s) Output: ?

127.0.0.1 - - [19/Jun/2023 22:42:34] "POST /api/v1/generate HTTP/1.1" 200 -

Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.19, "temperature": 0.79, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 0.95, "rep_pen_range": 1024, "rep_pen_slope": 0.9, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "You: Hello!\nJohn: Hi there. How can I help you?\nYou: How are you?\nJohn:", "quiet": true, "stop_sequence": ["You:", "\n"]}

[Debug: Dump Input Tokens, format: 5] 'You (3492)', ': (29901)', ' How (1128)', ' are (526)', ' you (366)', '? (29973)', '\n (13)', 'John (11639)', ': (29901)',

[Debug: Context Size = 18] ' (1)', ' You (887)', ': (29901)', ' Hello (15043)', '! (29991)', '\n (13)', 'John (11639)', ': (29901)', ' Hi (6324)', ' there (727)', '. (29889)', ' How (1128)', ' can (508)', ' I (306)', ' help (1371)', ' you (366)', '? (29973)', '\n (13)',

Processing Prompt (9 / 9 tokens) Generating (1 / 8 tokens) [( I 51.47%)] Generating (2 / 8 tokens) [(' 76.82%)] Generating (3 / 8 tokens) [(m 99.98%)] Generating (4 / 8 tokens) [( doing 69.46%)] Generating (5 / 8 tokens) [( well 57.87%)] Generating (6 / 8 tokens) [(, 87.53%)] Generating (7 / 8 tokens) [( thanks 78.37%)] Generating (8 / 8 tokens) [( for 95.65%)]

Time Taken - Processing:1.5s (163ms/T), Generation:1.8s (229ms/T), Total:3.3s (2.4T/s) Output: I'm doing well, thanks for 127.0.0.1 - - [19/Jun/2023 22:42:59] "POST /api/v1/generate HTTP/1.1" 200 -

Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.19, "temperature": 0.79, "top_p": 0.9, "top_k": 0, "top_a": 0, "typical": 1, "tfs": 0.95, "rep_pen_range": 1024, "rep_pen_slope": 0.9, "sampler_order": [6, 0, 1, 2, 3, 4, 5], "prompt": "You: Hello!\nJohn: Hi there. How can I help you?\nYou: How are you?\nJohn: I'm doing well, thanks for", "quiet": true, "stop_sequence": ["You:", "\n"]}

[Debug: Dump Input Tokens, format: 5] ' for (363)',

[Debug: Context Size = 34] ' (1)', ' You (887)', ': (29901)', ' Hello (15043)', '! (29991)', '\n (13)', 'John (11639)', ': (29901)', ' Hi (6324)', ' there (727)', '. (29889)', ' How (1128)', ' can (508)', ' I (306)', ' help (1371)', ' you (366)', '? (29973)', '\n (13)', 'You (3492)', ': (29901)', ' How (1128)', ' are (526)', ' you (366)', '? (29973)', '\n (13)', 'John (11639)', ': (29901)', ' I (306)', '' (29915)', 'm (29885)', ' doing (2599)', ' well (1532)', ', (29892)', ' thanks (3969)',

Processing Prompt (1 / 1 tokens) Generating (1 / 8 tokens) [( asking 99.98%)] Generating (2 / 8 tokens) [(. 97.69%)] Generating (3 / 8 tokens) [(How 44.11%)] Generating (4 / 8 tokens) [( are 89.04%)] Generating (5 / 8 tokens) [( you 92.23%)] Generating (6 / 8 tokens) [(? 87.52%)] Generating (7 / 8 tokens) [(\n 80.88%)]

(Stop sequence triggered: <

) Time Taken - Processing:0.3s (283ms/T), Generation:1.6s (224ms/T), Total:1.8s (3.8T/s) Output: asking.How are you?

127.0.0.1 - - [19/Jun/2023 22:43:01] "POST /api/v1/generate HTTP/1.1" 200 -

LostRuins commented 1 year ago

Looks fine. I guess there's something not very correct in the way your client handles the server events?

Vladonai commented 1 year ago

The logic of my application is based on the fact that the whole prompt is processed for each request. But in debugmode I see that the last request is separated from the rest of the context and after that some part of it is processed. The algorithm for selecting this part is not clear to me.

LostRuins commented 1 year ago

No, all requests will be coming from your client. If you see two parts processed, it means your client sent two requests. The server will never start a new request on its own. You should print some logs from your client when you send a new request and analyze why.

Vladonai commented 1 year ago

[Debug: Dump Input Tokens, format: 5] ' (1)', ' You (887)', ': (29901)', ' Hello (15043)', '! (29991)', '\n (13)', 'John (11639)', ': (29901)',

[Debug: Context Size = 0]

Next request: [Debug: Dump Input Tokens, format: 5] 'You (3492)', ': (29901)', ' How (1128)', ' are (526)', ' you (366)', '? (29973)', '\n (13)', 'John (11639)', ': (29901)',

[Debug: Context Size = 18] ' (1)', ' You (887)', ': (29901)', ' Hello (15043)', '! (29991)', '\n (13)', 'John (11639)', ': (29901)', ' Hi (6324)', ' there (727)', '. (29889)', ' How (1128)', ' can (508)', ' I (306)', ' help (1371)', ' you (366)', '? (29973)', '\n (13)',

How does your program separate Input Tokens from Context Size? It is possible by saving the previous Prompt and comparing it with the new one. Is this the way to do it?

LostRuins commented 1 year ago

it's just subtracted. the input tokens have a fast forwarding algorithm that skips duplicate text, but that is not relevant to the API.

Vladonai commented 1 year ago

An error has been detected. Here is an example of a conversation. Until the size of 1548 tokens is reached, everything works as expected. But then everything crashes. Subtraction no longer works and the model "forgets" everything the previous conversation did. Compare the file at the beginning and at the end. https://mega.nz/file/fBdxXI6D#E0vIuImpNL1jI_0lc0wXmZsaFdUQki7RH6ODBL7mM_w

LostRuins commented 1 year ago

What's wrong? the log looks fine to me.

Vladonai commented 1 year ago

What happened to the algorithm when it reached the context of 1548 tokens?

LostRuins commented 1 year ago

It's at the maximum context size, and the earlier text is being trimmed. You can increase the token limit by setting a longer context size.

LostRuins commented 1 year ago

Remember, you are Requesting 500 tokens. That will be deducted from the total context size.

LostRuins commented 1 year ago

1548 + 500 is 2048

Vladonai commented 1 year ago

Yes, after I reserved the response tokens, everything worked as expected. It looks like my SSE-streaming implementation works fine. But the context dropout could have persisted - when the model doesn't remember what happened two or three replicas ago. I'll run some tests and write whether this aspect has changed.

Vladonai commented 1 year ago

Everything seems to be working. At least the results from the SSE-streaming endpoint don't differ from the results from /api/v1/generate. The fact is, however, that the 13B models cannot be used after the 30B models - the quality of the results is completely useless. But this is not a problem of the program :)