gbaptista / gemini-ai

A Ruby Gem for interacting with Gemini through Vertex AI, Generative Language API, or AI Studio, Google's generative AI services.
https://rubygems.org/gems/gemini-ai
MIT License
97 stars 21 forks source link

Streaming is not working #11

Closed alchaplinsky closed 8 months ago

alchaplinsky commented 8 months ago

Hey @gbaptista, Good job on the integration with Gemini models! I was trying this gem out for a project of mine and I couldn't get streaming to actually work. I tried connecting to both vertex-ai-api and generative-language-api. As well as passing server_sent_events: true option to both client initializer and stream_generate_content method.

Initializing a client

client = Gemini.new(
  credentials: {
    service: 'vertex-ai-api',
    file_path: 'google-credentials.json',
    region: 'us-east4'
  },
  options: { model: 'gemini-pro', server_sent_events: true }
)

Running a request

client.stream_generate_content({ contents: { role: 'user', parts: { text: 'generate a short story...' } } }, server_sent_events: true) do |event|
  puts event
end

It just spits out all the puts and return results all at once. I thought there was some output buffering happening so that it outputs everything together, but the block gets called as data comes in, so I tried this:

client.stream_generate_content({ contents: { role: 'user', parts: { text: 'generate a short story...' } } }, server_sent_events: true) do |event|
  puts Time.now
end

and the output was:

2024-01-22 11:56:19.639536 +0700
2024-01-22 11:56:19.639734 +0700
2024-01-22 11:56:19.640454 +0700
2024-01-22 11:56:19.640596 +0700
2024-01-22 11:56:19.640801 +0700
2024-01-22 11:56:19.640929 +0700
2024-01-22 11:56:19.641091 +0700
2024-01-22 11:56:19.641239 +0700
2024-01-22 11:56:19.641432 +0700
2024-01-22 11:56:19.641593 +0700

So, even though the request took several seconds as the model generated a longish story all the blocks were executed in the same second when the final result was returned from stream_generate_content method.

Am I missing something that's needed to make streaming work? Or is it the way it works - you get response chunked, but chunks arrive at the same time?

gbaptista commented 8 months ago

Hey, thanks for reaching out! You are doing everything right; unfortunately, this is the "streaming" that Google provides. My first guess was that because of safety, Google needs to first ensure the output is safe before delivering it to you; that's why it hangs and then floods you with all events at once (just a hypothesis, I have no evidence of this being a fact).

I tried disabling the safety stuff to see if it would get better, but no luck:

client.stream_generate_content({
  contents: { role: 'user', parts: { text: 'hi!' } },
  safetySettings: [
    {
      category: 'HARM_CATEGORY_SEXUALLY_EXPLICIT',
      threshold: 'BLOCK_NONE'
    },
    {
      category: 'HARM_CATEGORY_HATE_SPEECH',
      threshold: 'BLOCK_NONE'
    },
    {
      category: 'HARM_CATEGORY_HARASSMENT',
      threshold: 'BLOCK_NONE'
    },
    {
      category: 'HARM_CATEGORY_DANGEROUS_CONTENT',
      threshold: 'BLOCK_NONE'
    }
  ]
})

I wrote more about this here: LBPE Score: A New Perspective for Evaluating AI LLMs

And compared Gemini "streaming" with other providers: stream

stream

"Gemini Pro’s “streaming” is mostly waiting, then a burst of activity — it’s not genuinely streaming."

From my research and experiments, we are doing everything right according to the documentation (?alt=sse), and this is how it works. I would be happy to find out we are missing something, but so far, it sounds like it is what it is.

alchaplinsky commented 8 months ago

Hey @gbaptista, Thanks for your descriptive response. I tried to implement my own client with streaming response using a couple of tools like Faraday, Net::http and I got the same result. However, if you try requesting gemini with curl you're actually getting a streaming-like behavior.

 curl -N -X POST -H "Content-Type: application/json" -d '{"contents": [{"role": "user", "parts": [{"text": "tell me short story"}]}]}' https://generativelanguage.googleapis.com/v1beta/models/gemini-pro:streamGenerateContent\?key\=API_KEY 2> /dev/null | grep "text"

So I was wondering if the server behaves differently when you make a request with curl or is it something within ruby that buffers the response.

gbaptista commented 8 months ago

Oh, that's super helpful. I will try to investigate that; maybe there are some internals in Faraday that we can tweak to make it work that way! Also, if you find something, please share it as well.

alchaplinsky commented 8 months ago

I also tried prompting gemini-pro with Python using google-cloud-aiplatform package and it seems to support streaming as well.

https://github.com/gbaptista/gemini-ai/assets/695947/feed2506-f669-441e-9e6c-881b710e2a94

gbaptista commented 8 months ago

Ok, found it. Add this gem:

gem 'faraday-typhoeus', '~> 1.1'

Add this before your code:

require 'faraday'
require 'faraday/typhoeus'

Faraday.default_adapter = :typhoeus

Streaming should work now.

Probably related to this:

I'm going to give some thought to how to include this in the gem. Typhoeus was the first one that worked, but I need to consider whether we want to provide a specific alternative default adapter. If we decide to do so, we need to choose which adapter would be the best choice.

alchaplinsky commented 8 months ago

Oh, cool! I was also able to make it work with typhoeus gem doing Typhoeus::Request.new. That's great that it is possible to do this just by changing Faraday's adapter. IMO gemini-ai gem could just swap default_adapter to typhoeus in case of streaming response for you. Something like:

response = Faraday.new(request: @request_options) do |faraday|
  faraday.response :raise_error
  faraday.default_adapter = :typhoeus if server_sent_events_enabled
end.post do |request|
...
alchaplinsky commented 8 months ago

Another challenge with streaming response from gemini-pro is that it is a chunked JSON string. So in the first chunk you get something like:

[{
   "candidates": [...]
}

then the next once can be

,
{
   "candidates": [...]
}

I've even seen the case when the candidates object was divided between 2 chunks. So by default these chunks are likely not parsable JSON. One would need to analyze and alter the string received from server before it could be parsed and a callback block called.

gbaptista commented 8 months ago

@alchaplinsky That's true, but fortunately, I already solved this problem (partial JSON responses) in another gem, so I know how to deal with it. Let's get it done; I will prepare a PR.

gbaptista commented 8 months ago

@alchaplinsky done:

alchaplinsky commented 8 months ago

Great job @gbaptista ! Thanks for resolving this.