deepsense-ai / ragbits

Building blocks for rapid development of GenAI applications
https://ragbits.deepsense.ai
MIT License
13 stars 2 forks source link

feat: few shots should support image input #155

Open kdziedzic68 opened 1 month ago

kdziedzic68 commented 1 month ago

Feature description

Action items:

Motivation

Users would need to create few shot learning systems for image processing - eg. classification etc.

Additional context

No response

ludwiktrammer commented 1 month ago

I believe "FewShotExample type should be extended with optional field representing list of input images" is not necessary. FewShotExample object already contains the input model object, which already contains images (for prompts with images).

So we only need to make sure that the existing API for providing few show examples works as expected:

prompt.add_few_shot(SongData(name="Alice", age=30, theme="pop", cover_image=image_data), "It's a really catchy tune.")

Currently cover_image will be ignored. It should be used and added to the conversation as an image (provided that is present in image_input_fields)

ludwiktrammer commented 1 month ago

usage of list_few_shots should be moved to LLM method: _format_chat_for_llm because of deciding whether the given model supports vision

This is complicated. I think it would be good to discuss during grooming what kind of data should Prompt's chat() method return:

  1. The full conversation in the OpenAI format, including images and other non-standard elements
  2. A list of messages (object of specific data classes) that is independent of the OpenAI format and specifies different elements of the message separately. It would be LLMs role to change this to a format needed by the LLM model and to decide which elements to use.
  3. Only the textual part of the conversation. It would be LLMs role to obtain other elements (like images) by calling prompt's methods separately and try to integrate them with the textual conversation.

At the beginning of the project we discussed between 1 and 2 and decided to go with 1. Adding images showed some disadvantages of 1 (prompt alone cannot know what the particular LLM model can handle).

Currently (with the latest PR adding images to prompt and with how this ticket is written) we seem to be going the route of 3. I'm not convinced it's the best route - it seems quite wobbly (for example: knowing which image to add to which element of the conversation). I'm not convinced it's the best route - I think it would be worth revisiting our previous discussion as the team.

ludwiktrammer commented 1 week ago

@mhordynski I believe you wanted to read through the comments here and decide on one of the options