Open bien-phillip opened 7 months ago
HI. I have a server with two ollama instances and I use streaming mode from my clients. My client uses the /generate endpoint and it works. Here I see you are using chat endpoint instead. The only endpoint that uses the double queues is the /generate I have to take a look at that endpoint I think that's why it is not working. I'll check out this and add the /chat endpoint to the queued management.
Thanks for informing me
OK, just went through threir doc and now I added the chat endpoint to the list of queued endpoints
Thank you for response. I will check /chat endpoint again. And I think streaming issue is not cause this source. I'm not sure though it should be langchain issue. I'll test it also.
Even using /generate, the output break into many responses but all responses show together. Streaming is not work.
Even using /generate, the output break into many responses but all responses show together. Streaming is not work. What client are you using? I just tested using lollms.
Hi ParisNeo.
First of all, thank you for share awesome project. I just check it for parallel requests with Ollama. I found this repo from Ollama issue.
I've run two Ollama instances and run ollama_proxy_server. And I found two problem. One is that's not working with stream response. It works just send last data. Other is qsize couldn't get right. I attached screenshot.
It's just first try. I'll test it more and give feedback more.
Thanks.