Streaming is not working

alchaplinsky commented 8 months ago

Hey @gbaptista, Good job on the integration with Gemini models! I was trying this gem out for a project of mine and I couldn't get streaming to actually work. I tried connecting to both vertex-ai-api and generative-language-api. As well as passing server_sent_events: true option to both client initializer and stream_generate_content method.

Initializing a client

client = Gemini.new(
  credentials: {
    service: 'vertex-ai-api',
    file_path: 'google-credentials.json',
    region: 'us-east4'
  },
  options: { model: 'gemini-pro', server_sent_events: true }
)

Running a request

client.stream_generate_content({ contents: { role: 'user', parts: { text: 'generate a short story...' } } }, server_sent_events: true) do |event|
  puts event
end

It just spits out all the puts and return results all at once. I thought there was some output buffering happening so that it outputs everything together, but the block gets called as data comes in, so I tried this:

client.stream_generate_content({ contents: { role: 'user', parts: { text: 'generate a short story...' } } }, server_sent_events: true) do |event|
  puts Time.now
end

and the output was:

2024-01-22 11:56:19.639536 +0700
2024-01-22 11:56:19.639734 +0700
2024-01-22 11:56:19.640454 +0700
2024-01-22 11:56:19.640596 +0700
2024-01-22 11:56:19.640801 +0700
2024-01-22 11:56:19.640929 +0700
2024-01-22 11:56:19.641091 +0700
2024-01-22 11:56:19.641239 +0700
2024-01-22 11:56:19.641432 +0700
2024-01-22 11:56:19.641593 +0700

So, even though the request took several seconds as the model generated a longish story all the blocks were executed in the same second when the final result was returned from stream_generate_content method.

Am I missing something that's needed to make streaming work? Or is it the way it works - you get response chunked, but chunks arrive at the same time?

gbaptista commented 8 months ago

Hey, thanks for reaching out! You are doing everything right; unfortunately, this is the "streaming" that Google provides. My first guess was that because of safety, Google needs to first ensure the output is safe before delivering it to you; that's why it hangs and then floods you with all events at once (just a hypothesis, I have no evidence of this being a fact).

I tried disabling the safety stuff to see if it would get better, but no luck:

client.stream_generate_content({
  contents: { role: 'user', parts: { text: 'hi!' } },
  safetySettings: [
    {
      category: 'HARM_CATEGORY_SEXUALLY_EXPLICIT',
      threshold: 'BLOCK_NONE'
    },
    {
      category: 'HARM_CATEGORY_HATE_SPEECH',
      threshold: 'BLOCK_NONE'
    },
    {
      category: 'HARM_CATEGORY_HARASSMENT',
      threshold: 'BLOCK_NONE'
    },
    {
      category: 'HARM_CATEGORY_DANGEROUS_CONTENT',
      threshold: 'BLOCK_NONE'
    }
  ]
})

I wrote more about this here: LBPE Score: A New Perspective for Evaluating AI LLMs

And compared Gemini "streaming" with other providers: stream

stream

"Gemini Pro’s “streaming” is mostly waiting, then a burst of activity — it’s not genuinely streaming."

From my research and experiments, we are doing everything right according to the documentation (?alt=sse), and this is how it works. I would be happy to find out we are missing something, but so far, it sounds like it is what it is.

alchaplinsky commented 8 months ago

Hey @gbaptista, Thanks for your descriptive response. I tried to implement my own client with streaming response using a couple of tools like Faraday, Net::http and I got the same result. However, if you try requesting gemini with curl you're actually getting a streaming-like behavior.

 curl -N -X POST -H "Content-Type: application/json" -d '{"contents": [{"role": "user", "parts": [{"text": "tell me short story"}]}]}' https://generativelanguage.googleapis.com/v1beta/models/gemini-pro:streamGenerateContent\?key\=API_KEY 2> /dev/null | grep "text"

So I was wondering if the server behaves differently when you make a request with curl or is it something within ruby that buffers the response.

gbaptista commented 8 months ago

Oh, that's super helpful. I will try to investigate that; maybe there are some internals in Faraday that we can tweak to make it work that way! Also, if you find something, please share it as well.

alchaplinsky commented 8 months ago

I also tried prompting gemini-pro with Python using google-cloud-aiplatform package and it seems to support streaming as well.

https://github.com/gbaptista/gemini-ai/assets/695947/feed2506-f669-441e-9e6c-881b710e2a94

gbaptista commented 8 months ago

Ok, found it. Add this gem:

gem 'faraday-typhoeus', '~> 1.1'

Add this before your code:

require 'faraday'
require 'faraday/typhoeus'

Faraday.default_adapter = :typhoeus

Streaming should work now.

Probably related to this:

Stream the response body of an HTTP GET to an HTTP POST with Ruby

I'm going to give some thought to how to include this in the gem. Typhoeus was the first one that worked, but I need to consider whether we want to provide a specific alternative default adapter. If we decide to do so, we need to choose which adapter would be the best choice.

alchaplinsky commented 8 months ago

Oh, cool! I was also able to make it work with typhoeus gem doing Typhoeus::Request.new. That's great that it is possible to do this just by changing Faraday's adapter. IMO gemini-ai gem could just swap default_adapter to typhoeus in case of streaming response for you. Something like:

response = Faraday.new(request: @request_options) do |faraday|
  faraday.response :raise_error
  faraday.default_adapter = :typhoeus if server_sent_events_enabled
end.post do |request|
...

alchaplinsky commented 8 months ago

Another challenge with streaming response from gemini-pro is that it is a chunked JSON string. So in the first chunk you get something like:

[{
   "candidates": [...]
}

then the next once can be

,
{
   "candidates": [...]
}

I've even seen the case when the candidates object was divided between 2 chunks. So by default these chunks are likely not parsable JSON. One would need to analyze and alter the string received from server before it could be parsed and a callback block called.

gbaptista commented 8 months ago

@alchaplinsky That's true, but fortunately, I already solved this problem (partial JSON responses) in another gem, so I know how to deal with it. Let's get it done; I will prepare a PR.

gbaptista commented 8 months ago

@alchaplinsky done:

12
Ruby Gemini AI 3.2.0

alchaplinsky commented 8 months ago

Great job @gbaptista ! Thanks for resolving this.

gbaptista / gemini-ai

Streaming is not working #11

12