RMNCLDYO / gemini-ai-toolkit

Unlock the potential of Google's Gemini AI models with this versatile toolkit. Offering seamless chat, text generation, and multimodal interactions, supporting various file types, including PDF's, images, videos, audio, text and more. Enjoy real-time responses, customizable parameters, and easy integration for diverse AI tasks.
MIT License
39 stars 9 forks source link

Multiple Images in Single API Call #2

Closed Philomath88 closed 1 month ago

Philomath88 commented 6 months ago

Is your feature request related to a problem? Please describe. Yes, I am facing a challenge with uploading images (specifically, pages extracted from a PDF document) to a server or API for processing. My goal is to extract text from these images for further analysis or storage. The current system or tool I'm using does not support batch processing of multiple images in a single request, which leads to inefficiencies and increased processing time.

Describe the solution you'd like I would like a feature that enables the batch uploading and processing of multiple images in a single API request. This feature should allow me to send a list of images (converted pages from a PDF document) and receive a consolidated response that includes the extracted text from each image. Ideally, the solution would handle varying image formats and sizes, ensuring accurate text extraction. Additionally, having the ability to specify certain parameters for text extraction, such as language or extraction mode (e.g., OCR, structured text extraction), would be highly beneficial.

Describe alternatives you've considered An alternative solution I've considered involves manually splitting the PDF into individual pages and sending separate requests for each page. However, this approach is not scalable and increases the complexity of handling responses and reassembling the text in the correct order. Another alternative is using a third-party service that supports batch processing, but this often comes with higher costs and potential data privacy concerns.

Additional context In my use case, the ability to efficiently process documents and extract text is crucial for data analysis and entry. The documents I'm dealing with are often scanned pages of text, which necessitates robust OCR capabilities. Enhancing the current system to support batch image processing in a single request would significantly improve our workflow, reduce processing times, and potentially increase accuracy by allowing context to be maintained across pages.

SAHRIAR-ANIK commented 1 month ago

We need multiple image at once in Gemini. Plz bring it very soon

RMNCLDYO commented 1 month ago

Hey,

Just wanted to let you both know that your feature request for batch uploading and processing multiple images in a single API request is now available in v1.3.

While the Gemini model supports text extraction from images, it's not as precise as dedicated OCR tools. You can specify extraction requirements via the prompt or system prompt, but the results may vary. Uploading individual images of each PDF page as opposed to a single (large) PDF might help improve accuracy (I have noticed that the models do have a hard time processing PDF's).

The Gemini AI Toolkit now supports multimodal prompting, allowing you to include text, image, video, audio, documents, code, and more in your prompts.

See usage details in the README here.

For reference: the File API allows storing up to 20GB per project, with each file up to 2GB. Files are only stored on Google's cloud servers for 48 hours. Google requires files sent through the API to be uploaded to the File API first, which has these limits.

I'm going to close this issue now since it's taken care of. If you run into any other issues or have more suggestions, feel free to open a new one.

Thanks for the input!

Cheers