[Bug]: Barely multimodal and useless for the most part in PDF extraction

Thanks for the feedback on this sample notebook. As a reminder, we aim to keep a friendly, welcoming, and constructive community and environment here per our Code of Conduct.

Multimodal Function Calling is a different use case than sending a prompt with an image or PDF to Gemini and asking it to extract details or text from the document. If you need that functionality, a simpler multimodal request to the Gemini API will work great. You could also add Controlled Generation to get the contents of documents as a structured data object, without needing to use Gemini Function Calling at all.

Multimodal Function Calling aims to go further when you need to take action based on the results of a multimodal function call request. This involves defining a JSON schema for your function (a FunctionDeclaration), wrapping those FunctionDeclarations in a tool, then using Function Calling as you normally would to get predicted functions & parameters, call an external API or function, then return the results to Gemini.

So, in order to modify the PDF example in https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb, you would need to modify the JSON schema to specify the exact data structure that you want to output, modify the files and/or prompts as needed, then handle the function name and parameters to make an external API or function call.

It's hard to say without seeing the full inputs and outputs that you used, but it might be the case that you didn't update the JSON schema in the FunctionDeclaration and are only seeing the company name in the output, as defined in the current FunctionDeclaration. In summary, consider using multimodal calls to Gemini API or Controlled Generation if you're looking to just extract details from documents. Or if you need those plus want to implement Function Calling on top of that, Multimodal Function Calling might be a good fit!

GoogleCloudPlatform / generative-ai

[Bug]: Barely multimodal and useless for the most part in PDF extraction #1278

File Name

What happened?

Relevant log output

Code of Conduct