GoogleCloudPlatform / generative-ai

Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI
https://cloud.google.com/vertex-ai/docs/generative-ai/learn/overview
Apache License 2.0
7.54k stars 2.1k forks source link

[Bug]: Barely multimodal and useless for the most part in PDF extraction #1278

Closed sidoncloud closed 4 weeks ago

sidoncloud commented 4 weeks ago

File Name

https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb

What happened?

Change the query in the PDF extraction part to "Retrieve the details of the sold items along with the amount paid for Bikbear.". Here i am trying to fetch the other details from the PDF besides the company name which is not a huge deal anyway. You will realize how terribly Gemini performs in extracting those details. Basically does nothing.

Relevant log output

No response

Code of Conduct

koverholt commented 4 weeks ago

Thanks for the feedback on this sample notebook. As a reminder, we aim to keep a friendly, welcoming, and constructive community and environment here per our Code of Conduct.

Multimodal Function Calling is a different use case than sending a prompt with an image or PDF to Gemini and asking it to extract details or text from the document. If you need that functionality, a simpler multimodal request to the Gemini API will work great. You could also add Controlled Generation to get the contents of documents as a structured data object, without needing to use Gemini Function Calling at all.

Multimodal Function Calling aims to go further when you need to take action based on the results of a multimodal function call request. This involves defining a JSON schema for your function (a FunctionDeclaration), wrapping those FunctionDeclarations in a tool, then using Function Calling as you normally would to get predicted functions & parameters, call an external API or function, then return the results to Gemini.

So, in order to modify the PDF example in https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb, you would need to modify the JSON schema to specify the exact data structure that you want to output, modify the files and/or prompts as needed, then handle the function name and parameters to make an external API or function call.

It's hard to say without seeing the full inputs and outputs that you used, but it might be the case that you didn't update the JSON schema in the FunctionDeclaration and are only seeing the company name in the output, as defined in the current FunctionDeclaration. In summary, consider using multimodal calls to Gemini API or Controlled Generation if you're looking to just extract details from documents. Or if you need those plus want to implement Function Calling on top of that, Multimodal Function Calling might be a good fit!