Closed RebelOfDeath closed 4 months ago
Initial implementation was provided in a7399bfa240fc32c007260f81db7c7140a739491
Okay I get to implement async streaming in the modified vllm
file then :)
@Ar4l, the current implementation has not been validated yet (but it will be with the completion of #18) so I would wait a bit before you actually start adding code to that part of the application, just cuz there is a very high likelihood that it will have to be changed/refactored.
But it would be pretty cool to be able to stream the completions to the user, we could even do some sort of A/B test to see the impact streaming has on the user's likelihood to accept completions 🤔
tested implementaion in 5a9c7163ec80048cc381f88f155fd2055acc4026 and ensured that they are working as intended.
Thanks @RebelOfDeath , I like the idea of an A/B study. I think it can be nice to have especially when users manually invoke the completions.
However, I have seen the general finding in several papers that manual key binds result in poor tool adoption/usage.
On the other hand, streaming them continuously whenever the user stops typing is also probably very distracting.
My study on filtering out those moments is maybe a solution to this. But, my instinct tells me false positives have added weight of increased distraction in the streaming setting. We probably need a bit more work on filtering out any suggestions that could be distracting to the user.
Anyway, the point is there are questions here that we could address with an A/B user study. E.g. maybe we only stream responses when the filter is absolutely confident that what the LLM will generate is useful, and we can find the appropriate threshold with A/B.
Introduction of a WebSocket-based serving strategy which would reduce the overall overhead for the requests being made. I think given that we will soon be shifting focus to developing the actual plugins for CoCo it would be nice to have something like this which would (at least in my opinion) be a major improvement WRT prior versions.