argilla-io / argilla

Argilla is a collaboration platform for AI engineers and domain experts that require high-quality outputs, full data ownership, and overall efficiency.
https://argilla-io.github.io/argilla/latest/
Apache License 2.0
3.66k stars 345 forks source link

[FEATURE] Add conversation support to Feedback #3338

Open dchichkov opened 1 year ago

dchichkov commented 1 year ago

Is your feature request related to a problem? Please describe. It is not clear how to use Feedback scheme to store multi-turn conversation. It doesn't look like there is support in the web interface to annotate feedback for multi-turn conversations. A lot of Instruction Tuning data is multi-turn and the Feedback scheme only allows to record feedback to a single response. Other schemes (i.e. Text2Text) also seem to not be suitable.

Describe the solution you'd like Native support for multi-turn, allowing to annotate feedback for each turn of the conversation.

Describe alternatives you've considered Getting feedback on each turn individually and then providing context (full conversation) for each turn. This is not great, as this results in the need to re-read each conversation N times, where N is the number of turns. Slowing down the progress substantially.

Alternative is storing the complete conversation as text. And using the external tool, like Gradio to annotate. This diminishes the value of Argilla, as it requires to add external/not integrated tool, disconnected from search / etc. Storing the conversation as text, instead of structured data also reduces the ability to filter the conversation in the structured way (i.e. by last response). Alternative is storing the complete conversation as a .json field. This is also not great, as nearly all tooling of argilla do not have any support for this.

Additional context Ideally this should be natively supported in the web UI. But this may be implemented using a tool/plugin outside the current web/gui. Using Gradio/Chat as a frontend may be a good option. But either way, this still requires support at the schema level.

nataliaElv commented 1 year ago

Hi @dchichkov ! Thanks for the suggestion! I definitely want to know more about your use case. Can I ask if the number of turns is always the same or if it changes for each conversation?

dvsrepo commented 1 year ago

Thanks @dchichkov! Looking forward to your comments. I think this is an important field type (or extension to the current text field). I believe I briefly comment this with @alvarobartt when we did the langchain callback. In my experience, the number of turns is variable across records. There several example datasets available on the Hub, but @dchichkov if you are looking at a specific dataset/format it would be awesome if you could give us some pointers

davidberenstein1957 commented 1 year ago

Exactly, this was also something we discussed during the ML weekly. Potentially allowing to have something like a TextListField and TextQuestionList

dchichkov commented 1 year ago

Hi @nataliaElv & @dvsrepo . An example could be https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K

The format there (this is a single multi-turn conversation) is: [ { "id": "000000033471", "image": "000000033471.jpg", "conversations": [ { "from": "human", "value": "<image>\nWhat are the colors of the bus in the image?" }, { "from": "gpt", "value": "The bus in the image is white and red." }, { "from": "human", "value": "What feature can be seen on the back of the bus?" }, { "from": "gpt", "value": "The back of the bus features an advertisement." }, { "from": "human", "value": "Is the bus driving down the street or pulled off to the side?" }, { "from": "gpt", "value": "The bus is driving down the street, which is crowded with people and other vehicles." } ] }, ...

I'm not sure that this exact format is the best. I think that simply using Markdown with the URL for images is a lot more natural, removes extra complexity and allows more flexibility on the renderer and the dataset side.

dchichkov commented 1 year ago

In terms of the number of conversations, turns, images, participants, participant names - it varies for every conversation. To put some rough numbers:

MoritzLaurer commented 1 year ago

+1 for this feature request. I feel like this will be an increasingly important functionality in annotation interfaces.

To give some more input/examples: DeepMind and Anthropic have created similar interfaces for their internal use and it would be great to have an open source option.

  1. In this paper from DeepMind on p. 51 you can see their interface. (Note that this interface also provides snippets for sources, which is not necessary for Argilla I think)

    Screenshot 2023-07-11 at 14 42 03 Screenshot 2023-07-11 at 14 42 14
  2. in this paper from Anthropic on p. 5 you can see their interface.

    Screenshot 2023-07-11 at 14 44 26

They've always limited the number of turns to a specific N amount (5~). Another important thing in both these studies is they they have human text input + live model text output (which could come from e.g. the Hugging Face inference API or some other LLM APIs), as well as separate annotation boxes to rate each model output (which probably makes things a bit more complicated).

One open-source implementation for chat-based annotation is Meta's Mephisto/ParlAI's chat interface that directly integrates with MTurk. I haven't tested it yet though. It's based on React and probably requires some JS/React knowledge customize https://github.com/facebookresearch/Mephisto/tree/main/examples/parlai_chat_task_demo

ttamg commented 1 year ago

Also +1 for this feature request. When you think of a chatbot type response with LLMs, we quickly end up with multi-turn conversations.

To add a bit more colour on what sort of feedback to log on a multi-turn conversation may be useful:

  1. Conversation-level metrics - a 1-5 rating, flags, classifiers, etc at the conversation level. This is very similar to the current demos and the feedback is captured for the complete conversation block
  2. Conversation section feedback - for example one response from gpt says 'To do that you need to do ... XYZ'. At this level it will be helpful to log metrics (ratings, flags, thumbs-up-down, etc). Also at this level it is helpful for the human anotator to be able to correct or rewrite the response.

Thanks!

davidberenstein1957 commented 11 months ago

@MoritzLaurer @ttamg, thanks for the context. Any feedback and context are always welcome.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open for 90 days with no activity.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 90 days with no activity.