Open aleksusklim opened 9 months ago
The current BLAS batch cannot be aborted once it's started. It will abort once the current batch is done (e.g. 512/2048 will stop at 1024). Have you tried enabling --multiuser
? It enables a queue which should help with your scenario.
The current BLAS batch cannot be aborted once it's started.
Yes, I'm aware, and this is the exact reason why I posted this Issue – to be able to retry automatically when the impossible-to-abort batch is done.
It will abort once the current batch is done
Yes, but until that – Lite shows errors rather than schedule resending!
Have you tried enabling
--multiuser
?
Wait, hold on. Now I tried it, and… is it doing exactly what I described!?
Even inside the BLAS, I can abort and regenerate without any visible errors, and the generation continues automatically!
It enables a queue which should help with your scenario.
Whoa, I found a drawback: if I change something at the beginning of the history and send it; then abort and send something again, abort and send, abort and send – then, kobolcpp finishes the batch and starts to generate something that was already aborted!
Then it prints Generate: The response could not be sent, maybe connection was terminated?
and generates another request – fully, not stopping at the very first token as it does when really aborting.
And again, again and again (while my Lite still "waits")
I suspect you either:
I think you should enforce a socket check in two moments:
In those two scenarios you should check is the user socket writable, and proceed only if the browser is still actually waiting for the result! Not after you spend time to generate something just to throw it off.
I suggest doing this only in multiuser mode, since I see a valid use-case when somebody sends a large request and disconnects – expecting koboldcpp to honestly generate everything (at least to fill the history cache, and at most still printing the text to console).
Are there other drawbacks of enabling the multiuser mode? Since if it will retry after aborted batch automatically (and if you can make it to cancel generations that nobody is waiting for) – it seems to be so good that I don't see reasons why to not use it always!
There's no way to tell that the client has already abandoned the session without attempting to send data. In fact for synchronous calls there's no indication to the server when a client disconnects except an error when attempting to write to the connection. For SSE streaming this is easier to detect as the connection is actively used.
But here, the issue is also with the Abort command - it's ignored for the users currently waiting in queue, as there is no queue to abort. Only the current active user has the ability to abort their connection. So here's a little illustration of the issue.
UserA sends request 1. Request 1 started processing. UserB sends request 2. Added to queue. UserA sends abort 1. Abort is successful and will trigger after current batch is done (or after current token is done). but if instead UserB sends abort 2, abort is ignored as UserB is not the active user (there is no way to remove userB's request on the queue currently). If UserB subsequently abandon's the connection, server will only know when it tries to respond, either on completion or on stream. Then, the request will be discarded and the next item will begin processing.
In order for non-started requests to be aborted, a queue of "pending aborts" will need to be logged and tracked. This itself can present another series of problems.
I think I can hack in a solution to keep track of the most recent "pending abort" that should work for a situation with 2 users. I really don't want to queue up aborts for more users than that.
Added to my experimental branch, if you'd like to test. Note that it only works for up to 2 aborts - multiple aborts by queued users will only save the last attempt.
There's no way to tell that the client has already abandoned the session without attempting to send data.
Are you sure? What kind of sockets do you have!?
Even for HTTP, you can just send the first line of response headers, which is always the same. But anyway, TCP socket should trigger an event on disconnection, maybe you need to listen on it?
It's just the basic python http.server, you pick a port and start a server that listens to it. You can see the current implementation in koboldcpp.py
It's not raw TCP sockets. Events such as do_POST()
are fired on incoming connections, you get a file-like I/O handle and write your response to it.
Python's http.server is exposing client socket object inside do_POST as self.connection
Since it is HTTP/1.0 server, serving each request in separate TCP connection, we can test for broken pipe by trying to read from the same socket.
There would be no additional data, and the .recv will block unless we set a zero timeout on the socket.
So the check comes down to calling .setblocking(False)
and then trying to .recv()
which will return the empty string for CLOSED sockets, but throw an exception BlockingIOError
for live sockets that have no data in the current buffer (thus wanting to wait for it, but we forbid it to block the call).
If for some reason the socket would still have some incoming bytes (a buggy browser that sent something after the POST body), recv will return a non-empty string which we won't use anyway – so no harm in swallowing it.
Recapping, you can start your handler as
def do_POST(self):
connection = self.connection
Then the variable connection
represents the state of this client.
To check it, you can use a helper function like this:
def is_connection_alive(connection):
connection.setblocking(False)
try:
return len(connection.recv(64)) > 0
except:
return True
I tried wrapping your line gen = asyncio.run(self.handle_request(genparams, api_format, sse_stream_flag))
as
print('ALIVE:',is_connection_alive(connection))
gen = asyncio.run(self.handle_request(genparams, api_format, sse_stream_flag))
print('ALIVE:',is_connection_alive(connection))
– And then in console I can clearly see whether the browser had already disconnected or not! (The first check always returns True since this is executing right when the handler fired; but the second check shows the state after the complete generation)
So I suppose you should pass the correct connection
socket (queuing it as needed) to your other functions up to when you know that BLAS has ended – and skip the "generation" phase if the socket turned out to be already dead.
This is not an ideal solution.
recv()
will consume arbitrary data from the buffer, potentially corrupting any incoming data in unpredictable ways. And yes, peek functions exist, but they are very platform dependent and non portable.In any case, the solution to the "Queued Abort" has already been added - so if that solves this issue there is no need to go messing with the TCP connection states. Unless there is a proper READ ONLY way of determining the connection status?
You have no idea what kind of data is awaiting in the incoming buffer.
I certainly know. There is no more data, always.
All well-behaving user agents send nothing after POST body, which size is determined by Content-Length header which you are already parsed successfully and received the full body data (lines content_length = int(self.headers['content-length'])
and body = self.rfile.read(content_length)
)
If the client erroneously sends something else – it will be never read by your code and WILL get dropped anyway.
You are serving HTTP/1.0 responses, which by protocol must close TCP connection after sending only one response to the single request. (See here https://stackoverflow.com/questions/50120102/python-http-server-keep-connection-alive)
You're overriding the blocking state of the socket.
Okay, this will suffice, as per: https://stackoverflow.com/a/34371270
old_timeout = socket.gettimeout()
and socket.settimeout(old_timeout)
instead of setblocking.
Unless there is a proper READ ONLY way of determining the connection status?
Well, you can say "What if I would use HTTP/1.1 with Keep-Alive?" and you will be right… This is the read-only solution, as per: https://stackoverflow.com/a/45948462
def is_connection_alive(connection):
import select
r,w,e = select.select([connection],[connection],[],0)
return (len(r)==0) and (len(w)==1)
("select" is a core module: https://docs.python.org/3/library/select.html#select.select)
Note that len(w)==1
is always true; and len(r)==0
means "no new data had arrived so far" OR "it is closed and nothing will ever arrive anymore" – and to distinguish these two cases you should call .recv
which you don't want to…
Here, if the client sends anything after POST body without waiting for a response – the connection would be wrongly identified as broken (since the socket becomes immediately readable). But even for HTTP/1.1: can a user-agent send two requests without waiting for the first answer? They can: https://stackoverflow.com/questions/21696733/issuing-multiple-requests-before-getting-response But it is not the case for POST: "Non-idempotent requests such as POST should not be pipelined." (https://en.wikipedia.org/wiki/HTTP_pipelining#Motivation_and_limitations)
I think you are safe to use the above implementation (just move "import" to the top of your source code), I've tested it on Windows with all three modes of token streaming.
Hmm noted. I will KIV this first. but in any case, the original issue is solved?
in any case, the original issue is solved?
I've just built and tested your current experimental branch (getting errors for 'tkinter' but drag-and-dropping .kcpp worked!) and at first I thought it is working as intended, aborting previous requests.
But then I disabled SSE token streaming and smashed Generate More + [ABORT] several times during BLAS.
The model did not stop its generation until "Amount to Gen." was reached, and then regenerated its response several times, sending me only the last one:
Please connect to custom endpoint at http://localhost:5001
Input: {…}
Processing Prompt [BLAS] (201 / 201 tokens)
Generation Aborted
Generating (129 / 128 tokens)
ContextLimit: 202/2048, Processing:75.23s (374.3ms/T), Generation:0.02s (23.0ms/T), Total:75.25s (infms/T = 0.00T/s)
Output: I
Generate: The response could not be sent, maybe connection was terminated?
Input: {…}
Processing Prompt (1 / 1 tokens)
Generating (128 / 128 tokens)
ContextLimit: 329/2048, Processing:1.15s (1151.0ms/T), Generation:136.62s (1067.4ms/T), Total:137.77s (1076.4ms/T = 0.93T/s)
Output: …TEXT…
Generate: The response could not be sent, maybe connection was terminated?
Input: {…}
Processing Prompt (1 / 1 tokens)
Generating (128 / 128 tokens)
ContextLimit: 329/2048, Processing:0.93s (928.0ms/T), Generation:129.34s (1010.4ms/T), Total:130.26s (1017.7ms/T = 0.98T/s)
Output: …TEXT…
Generate: The response could not be sent, maybe connection was terminated?
Input: {…}
Deferred Abort for GenKey: KCPP9851
Output:
Generate: The response could not be sent, maybe connection was terminated?
Input: {…}
Processing Prompt (1 / 1 tokens)
Generating (128 / 128 tokens)
ContextLimit: 329/2048, Processing:1.05s (1046.0ms/T), Generation:129.71s (1013.4ms/T), Total:130.76s (1021.6ms/T = 0.98T/s)
Output: …TEXT…
Wait, when you said "I will only track one previous user, not all of them" – I'd read that as "only the previous request can be aborted" and since I will ever use only two (the stalled in BLAS and the new one) – this will work. But is it fails when I sent 3 and more new requests while BLAS is still processing, even if I naturally abort them one by one?
It comes down to: will it continue to generate when client already disconnected or not.
I enabled SSE again and it aborted all that stale generations:
GeneratingToken streaming was interrupted or aborted!
Was it the case with original code also, when multiuser is in effect? If yes, then yours queue changes didn't add much to benefits of SSE. I'm not sure that I even have your fixes, I've built from https://github.com/LostRuins/koboldcpp/commit/8c22f109fad791bc66853a6ea4d3b4df0132f6e4
Note that it only works for up to 2 aborts - multiple aborts by queued users will only save the last attempt.
Yeah when I said 2 users, I meant 2 requests.
At first I wanted to create a new Issue but then decided to just tell here:
Can you NOT discard the partially fetched text with streaming/SSE when the connection breaks? So that all yellow text stays white instead of disappearing, as if I've pressed Abort?
I'm fine with shown error, but losing already generated text feels frustrating…
Fixed
Use case:
Submit
[ABORT]
link appears.[ABORT]
to cancel.Generate More
Generate More
(or, alternatively, I copy the last text back to the input box, hitBack
, then fix it and clickSubmit
again)Error Encountered
telling something aboutServer is busy; please try again later.
OK
Send Abort Command? Attempt to abort existing request?
Yes
orNo
, the BLAS step cannot be aborted.This is happening to me so often that now I came up with this idea:
Background generation in progress
with textThe koboldcpp server is busy generating its response to a previous request. You can try to abort it now.
and three buttons:Buttons
Abort
andCancel
are working as yoursYes
andNo
currently. ButAbort and retry
closes the popup and does this:[ABORT]
link under it.[ABORT]
while the real generation isn't started yet, the polling loop cancels, returning the UI to idle state just as this link is doing currently.Note that for point 5 you need to define, what should happen if the server responds that it is busy again: should we abort once more and restart polling, or error out? If this "generally should not be possible", then the error is better (it would indicate that somebody else sent a request just before us)
On the other hand, if the polling is unreliable you can forcibly send abort+generate each time unless the user cancels or the server goes offline. I don't know anything about Horde, I never used it and thus I cannot tell how it would affect it; but I assume you are either allowing aborting requests there (so the auto-repeat won't make it worse), or not.
What do you think?