Retry request if the generation cannot be aborted because of BLAS batch

LostRuins / lite.koboldai.net

A zero dependency web UI for KoboldAI Horde

GNU Affero General Public License v3.0

78 stars 37 forks source link

Retry request if the generation cannot be aborted because of BLAS batch #53

Open aleksusklim opened 9 months ago

aleksusklim commented 9 months ago

Use case:

I type my input.
I click Submit
That button changes to a spinner and gets disabled.
The server starts BLAS with my text.
I see my own line added to the history.
Under the spinner the [ABORT] link appears.
I realize I made a typo in my prompt! I want to edit it, but the history is read-only during the generation.
I click [ABORT] to cancel.
The spinner changes to a button titled Generate More
I edit the history to fix my mistake.
I click on Generate More (or, alternatively, I copy the last text back to the input box, hit Back, then fix it and click Submit again)
I see a popup Error Encountered telling something about Server is busy; please try again later.
I click OK
I see a new popup saying Send Abort Command? Attempt to abort existing request?
No matter if I'll reply Yes or No, the BLAS step cannot be aborted.
I wait until either the console shows something new, or the CPU load drops in Task Manager – this means the BLAS is over and the generation is finally aborted.
I send my input again; this time it successfully restarts my generation.

This is happening to me so often that now I came up with this idea:

Lite should check the returned error from the server.
If it is exactly the error about busy generation, do not show the actual error.
Instead, present a popup titled Background generation in progress with text The koboldcpp server is busy generating its response to a previous request. You can try to abort it now. and three buttons:

"Abort"
"Abort and retry"
"Cancel"

Buttons Abort and Cancel are working as yours Yes and No currently. But Abort and retry closes the popup and does this:

Aborts the generation as if "Yes" was pressed.
Turs the send button into a spinner again, with [ABORT] link under it.
In this state, UI should be read-only as if the generation was in progress.
Lite polls the server frequently (at the rate of regular token streaming) to query the generation state.
When the server finishes its work, Lite sends the actual generation request and then acts as normal.
If the user clicks [ABORT] while the real generation isn't started yet, the polling loop cancels, returning the UI to idle state just as this link is doing currently.

Note that for point 5 you need to define, what should happen if the server responds that it is busy again: should we abort once more and restart polling, or error out? If this "generally should not be possible", then the error is better (it would indicate that somebody else sent a request just before us)

On the other hand, if the polling is unreliable you can forcibly send abort+generate each time unless the user cancels or the server goes offline. I don't know anything about Horde, I never used it and thus I cannot tell how it would affect it; but I assume you are either allowing aborting requests there (so the auto-repeat won't make it worse), or not.

What do you think?

LostRuins commented 9 months ago

The current BLAS batch cannot be aborted once it's started. It will abort once the current batch is done (e.g. 512/2048 will stop at 1024). Have you tried enabling --multiuser? It enables a queue which should help with your scenario.

aleksusklim commented 9 months ago

The current BLAS batch cannot be aborted once it's started.

Yes, I'm aware, and this is the exact reason why I posted this Issue – to be able to retry automatically when the impossible-to-abort batch is done.

It will abort once the current batch is done

Yes, but until that – Lite shows errors rather than schedule resending!

Have you tried enabling --multiuser?

Wait, hold on. Now I tried it, and… is it doing exactly what I described!?

Even inside the BLAS, I can abort and regenerate without any visible errors, and the generation continues automatically!

It enables a queue which should help with your scenario.

Whoa, I found a drawback: if I change something at the beginning of the history and send it; then abort and send something again, abort and send, abort and send – then, kobolcpp finishes the batch and starts to generate something that was already aborted!

Then it prints Generate: The response could not be sent, maybe connection was terminated? and generates another request – fully, not stopping at the very first token as it does when really aborting. And again, again and again (while my Lite still "waits")

I suspect you either:

swallow aborted state when the processing it still inside BLAS; or
don't allow aborting concurrent requests in multiuser mode, even if they were made by the same user.

I think you should enforce a socket check in two moments:

When you pop a next request from the multiuser queue to start processing it – discarding this job if the socket was already dead.
When BLAS finishes and you are up to streaming (or just sending) back actual generation results – aborting the current job if the socket has died while the batch was processing.

In those two scenarios you should check is the user socket writable, and proceed only if the browser is still actually waiting for the result! Not after you spend time to generate something just to throw it off.

I suggest doing this only in multiuser mode, since I see a valid use-case when somebody sends a large request and disconnects – expecting koboldcpp to honestly generate everything (at least to fill the history cache, and at most still printing the text to console).

Are there other drawbacks of enabling the multiuser mode? Since if it will retry after aborted batch automatically (and if you can make it to cancel generations that nobody is waiting for) – it seems to be so good that I don't see reasons why to not use it always!

LostRuins commented 9 months ago

There's no way to tell that the client has already abandoned the session without attempting to send data. In fact for synchronous calls there's no indication to the server when a client disconnects except an error when attempting to write to the connection. For SSE streaming this is easier to detect as the connection is actively used.

But here, the issue is also with the Abort command - it's ignored for the users currently waiting in queue, as there is no queue to abort. Only the current active user has the ability to abort their connection. So here's a little illustration of the issue.

UserA sends request 1. Request 1 started processing. UserB sends request 2. Added to queue. UserA sends abort 1. Abort is successful and will trigger after current batch is done (or after current token is done). but if instead UserB sends abort 2, abort is ignored as UserB is not the active user (there is no way to remove userB's request on the queue currently). If UserB subsequently abandon's the connection, server will only know when it tries to respond, either on completion or on stream. Then, the request will be discarded and the next item will begin processing.

In order for non-started requests to be aborted, a queue of "pending aborts" will need to be logged and tracked. This itself can present another series of problems.

I think I can hack in a solution to keep track of the most recent "pending abort" that should work for a situation with 2 users. I really don't want to queue up aborts for more users than that.

LostRuins commented 9 months ago

Added to my experimental branch, if you'd like to test. Note that it only works for up to 2 aborts - multiple aborts by queued users will only save the last attempt.

aleksusklim commented 9 months ago

There's no way to tell that the client has already abandoned the session without attempting to send data.

Are you sure? What kind of sockets do you have!?

Even for HTTP, you can just send the first line of response headers, which is always the same. But anyway, TCP socket should trigger an event on disconnection, maybe you need to listen on it?

LostRuins commented 9 months ago

It's just the basic python http.server, you pick a port and start a server that listens to it. You can see the current implementation in koboldcpp.py

It's not raw TCP sockets. Events such as do_POST() are fired on incoming connections, you get a file-like I/O handle and write your response to it.

aleksusklim commented 9 months ago

Python's http.server is exposing client socket object inside do_POST as self.connection Since it is HTTP/1.0 server, serving each request in separate TCP connection, we can test for broken pipe by trying to read from the same socket.

There would be no additional data, and the .recv will block unless we set a zero timeout on the socket. So the check comes down to calling .setblocking(False) and then trying to .recv() which will return the empty string for CLOSED sockets, but throw an exception BlockingIOError for live sockets that have no data in the current buffer (thus wanting to wait for it, but we forbid it to block the call). If for some reason the socket would still have some incoming bytes (a buggy browser that sent something after the POST body), recv will return a non-empty string which we won't use anyway – so no harm in swallowing it.

Recapping, you can start your handler as

    def do_POST(self):
        connection = self.connection

Then the variable connection represents the state of this client. To check it, you can use a helper function like this:

def is_connection_alive(connection):
    connection.setblocking(False)
    try:
        return len(connection.recv(64)) > 0
    except:
        return True

I tried wrapping your line gen = asyncio.run(self.handle_request(genparams, api_format, sse_stream_flag)) as

                print('ALIVE:',is_connection_alive(connection))
                gen = asyncio.run(self.handle_request(genparams, api_format, sse_stream_flag))
                print('ALIVE:',is_connection_alive(connection))

– And then in console I can clearly see whether the browser had already disconnected or not! (The first check always returns True since this is executing right when the handler fired; but the second check shows the state after the complete generation)

So I suppose you should pass the correct connection socket (queuing it as needed) to your other functions up to when you know that BLAS has ended – and skip the "generation" phase if the socket turned out to be already dead.

LostRuins commented 9 months ago

This is not an ideal solution.

You have no idea what kind of data is awaiting in the incoming buffer. You also don't know what communication is currently ongoing between the server and client. Calling recv() will consume arbitrary data from the buffer, potentially corrupting any incoming data in unpredictable ways. And yes, peek functions exist, but they are very platform dependent and non portable.
You're overriding the blocking state of the socket. Even if you save and restore the blocking state to the original value (which isn't done right now), momentarily modifying a socket's properties that may already be in use is very bad practice. For example, some other thread which may be using the socket and blocking on it could suddenly find themselves unblocked but with incomplete data in the response.

In any case, the solution to the "Queued Abort" has already been added - so if that solves this issue there is no need to go messing with the TCP connection states. Unless there is a proper READ ONLY way of determining the connection status?

aleksusklim commented 9 months ago

You have no idea what kind of data is awaiting in the incoming buffer.

I certainly know. There is no more data, always. All well-behaving user agents send nothing after POST body, which size is determined by Content-Length header which you are already parsed successfully and received the full body data (lines content_length = int(self.headers['content-length']) and body = self.rfile.read(content_length)) If the client erroneously sends something else – it will be never read by your code and WILL get dropped anyway.

You are serving HTTP/1.0 responses, which by protocol must close TCP connection after sending only one response to the single request. (See here https://stackoverflow.com/questions/50120102/python-http-server-keep-connection-alive)

You're overriding the blocking state of the socket.

Okay, this will suffice, as per: https://stackoverflow.com/a/34371270 old_timeout = socket.gettimeout() and socket.settimeout(old_timeout) instead of setblocking.

Unless there is a proper READ ONLY way of determining the connection status?

Well, you can say "What if I would use HTTP/1.1 with Keep-Alive?" and you will be right… This is the read-only solution, as per: https://stackoverflow.com/a/45948462

        def is_connection_alive(connection):
            import select
            r,w,e = select.select([connection],[connection],[],0)
            return (len(r)==0) and (len(w)==1)

("select" is a core module: https://docs.python.org/3/library/select.html#select.select)

Note that len(w)==1 is always true; and len(r)==0 means "no new data had arrived so far" OR "it is closed and nothing will ever arrive anymore" – and to distinguish these two cases you should call .recv which you don't want to…

Here, if the client sends anything after POST body without waiting for a response – the connection would be wrongly identified as broken (since the socket becomes immediately readable). But even for HTTP/1.1: can a user-agent send two requests without waiting for the first answer? They can: https://stackoverflow.com/questions/21696733/issuing-multiple-requests-before-getting-response But it is not the case for POST: "Non-idempotent requests such as POST should not be pipelined." (https://en.wikipedia.org/wiki/HTTP_pipelining#Motivation_and_limitations)

I think you are safe to use the above implementation (just move "import" to the top of your source code), I've tested it on Windows with all three modes of token streaming.

LostRuins commented 9 months ago

Hmm noted. I will KIV this first. but in any case, the original issue is solved?

aleksusklim commented 9 months ago

in any case, the original issue is solved?

I've just built and tested your current experimental branch (getting errors for 'tkinter' but drag-and-dropping .kcpp worked!) and at first I thought it is working as intended, aborting previous requests.

But then I disabled SSE token streaming and smashed Generate More + [ABORT] several times during BLAS.

The model did not stop its generation until "Amount to Gen." was reached, and then regenerated its response several times, sending me only the last one:

Please connect to custom endpoint at http://localhost:5001

Input: {…}

Processing Prompt [BLAS] (201 / 201 tokens)
Generation Aborted

Generating (129 / 128 tokens)
ContextLimit: 202/2048, Processing:75.23s (374.3ms/T), Generation:0.02s (23.0ms/T), Total:75.25s (infms/T = 0.00T/s)
Output:  I
Generate: The response could not be sent, maybe connection was terminated?

Input: {…}

Processing Prompt (1 / 1 tokens)
Generating (128 / 128 tokens)
ContextLimit: 329/2048, Processing:1.15s (1151.0ms/T), Generation:136.62s (1067.4ms/T), Total:137.77s (1076.4ms/T = 0.93T/s)
Output:  …TEXT…

Generate: The response could not be sent, maybe connection was terminated?

Input: {…}

Processing Prompt (1 / 1 tokens)
Generating (128 / 128 tokens)
ContextLimit: 329/2048, Processing:0.93s (928.0ms/T), Generation:129.34s (1010.4ms/T), Total:130.26s (1017.7ms/T = 0.98T/s)
Output: …TEXT…

Generate: The response could not be sent, maybe connection was terminated?

Input: {…}

Deferred Abort for GenKey: KCPP9851

Output:
Generate: The response could not be sent, maybe connection was terminated?

Input: {…}

Processing Prompt (1 / 1 tokens)
Generating (128 / 128 tokens)
ContextLimit: 329/2048, Processing:1.05s (1046.0ms/T), Generation:129.71s (1013.4ms/T), Total:130.76s (1021.6ms/T = 0.98T/s)
Output:  …TEXT…

Wait, when you said "I will only track one previous user, not all of them" – I'd read that as "only the previous request can be aborted" and since I will ever use only two (the stalled in BLAS and the new one) – this will work. But is it fails when I sent 3 and more new requests while BLAS is still processing, even if I naturally abort them one by one?

It comes down to: will it continue to generate when client already disconnected or not. I enabled SSE again and it aborted all that stale generations: GeneratingToken streaming was interrupted or aborted!

Was it the case with original code also, when multiuser is in effect? If yes, then yours queue changes didn't add much to benefits of SSE. I'm not sure that I even have your fixes, I've built from https://github.com/LostRuins/koboldcpp/commit/8c22f109fad791bc66853a6ea4d3b4df0132f6e4

LostRuins commented 9 months ago

Note that it only works for up to 2 aborts - multiple aborts by queued users will only save the last attempt.

Yeah when I said 2 users, I meant 2 requests.

aleksusklim commented 9 months ago

At first I wanted to create a new Issue but then decided to just tell here:

Can you NOT discard the partially fetched text with streaming/SSE when the connection breaks? So that all yellow text stays white instead of disappearing, as if I've pressed Abort?

I'm fine with shown error, but losing already generated text feels frustrating…

LostRuins commented 9 months ago

Fixed