Closed alchaplinsky closed 8 months ago
Hey, thanks for reaching out! You are doing everything right; unfortunately, this is the "streaming" that Google provides. My first guess was that because of safety, Google needs to first ensure the output is safe before delivering it to you; that's why it hangs and then floods you with all events at once (just a hypothesis, I have no evidence of this being a fact).
I tried disabling the safety stuff to see if it would get better, but no luck:
client.stream_generate_content({
contents: { role: 'user', parts: { text: 'hi!' } },
safetySettings: [
{
category: 'HARM_CATEGORY_SEXUALLY_EXPLICIT',
threshold: 'BLOCK_NONE'
},
{
category: 'HARM_CATEGORY_HATE_SPEECH',
threshold: 'BLOCK_NONE'
},
{
category: 'HARM_CATEGORY_HARASSMENT',
threshold: 'BLOCK_NONE'
},
{
category: 'HARM_CATEGORY_DANGEROUS_CONTENT',
threshold: 'BLOCK_NONE'
}
]
})
I wrote more about this here: LBPE Score: A New Perspective for Evaluating AI LLMs
And compared Gemini "streaming" with other providers:
"Gemini Pro’s “streaming” is mostly waiting, then a burst of activity — it’s not genuinely streaming."
From my research and experiments, we are doing everything right according to the documentation (?alt=sse), and this is how it works. I would be happy to find out we are missing something, but so far, it sounds like it is what it is.
Hey @gbaptista,
Thanks for your descriptive response. I tried to implement my own client with streaming response using a couple of tools like Faraday, Net::http and I got the same result. However, if you try requesting gemini with curl
you're actually getting a streaming-like behavior.
curl -N -X POST -H "Content-Type: application/json" -d '{"contents": [{"role": "user", "parts": [{"text": "tell me short story"}]}]}' https://generativelanguage.googleapis.com/v1beta/models/gemini-pro:streamGenerateContent\?key\=API_KEY 2> /dev/null | grep "text"
So I was wondering if the server behaves differently when you make a request with curl
or is it something within ruby that buffers the response.
Oh, that's super helpful. I will try to investigate that; maybe there are some internals in Faraday that we can tweak to make it work that way! Also, if you find something, please share it as well.
I also tried prompting gemini-pro
with Python using google-cloud-aiplatform
package and it seems to support streaming as well.
https://github.com/gbaptista/gemini-ai/assets/695947/feed2506-f669-441e-9e6c-881b710e2a94
Ok, found it. Add this gem:
gem 'faraday-typhoeus', '~> 1.1'
Add this before your code:
require 'faraday'
require 'faraday/typhoeus'
Faraday.default_adapter = :typhoeus
Streaming should work now.
Probably related to this:
I'm going to give some thought to how to include this in the gem. Typhoeus was the first one that worked, but I need to consider whether we want to provide a specific alternative default adapter. If we decide to do so, we need to choose which adapter would be the best choice.
Oh, cool! I was also able to make it work with typhoeus gem doing Typhoeus::Request.new
. That's great that it is possible to do this just by changing Faraday's adapter.
IMO gemini-ai
gem could just swap default_adapter to typhoeus in case of streaming response for you. Something like:
response = Faraday.new(request: @request_options) do |faraday|
faraday.response :raise_error
faraday.default_adapter = :typhoeus if server_sent_events_enabled
end.post do |request|
...
Another challenge with streaming response from gemini-pro is that it is a chunked JSON string. So in the first chunk you get something like:
[{
"candidates": [...]
}
then the next once can be
,
{
"candidates": [...]
}
I've even seen the case when the candidates
object was divided between 2 chunks. So by default these chunks are likely not parsable JSON. One would need to analyze and alter the string received from server before it could be parsed and a callback block called.
@alchaplinsky That's true, but fortunately, I already solved this problem (partial JSON responses) in another gem, so I know how to deal with it. Let's get it done; I will prepare a PR.
@alchaplinsky done:
Great job @gbaptista ! Thanks for resolving this.
Hey @gbaptista, Good job on the integration with Gemini models! I was trying this gem out for a project of mine and I couldn't get streaming to actually work. I tried connecting to both
vertex-ai-api
andgenerative-language-api
. As well as passingserver_sent_events: true
option to both client initializer andstream_generate_content
method.Initializing a client
Running a request
It just spits out all the puts and return results all at once. I thought there was some output buffering happening so that it outputs everything together, but the block gets called as data comes in, so I tried this:
and the output was:
So, even though the request took several seconds as the model generated a longish story all the blocks were executed in the same second when the final result was returned from
stream_generate_content
method.Am I missing something that's needed to make streaming work? Or is it the way it works - you get response chunked, but chunks arrive at the same time?