dimagi / open-chat-studio

A web based platform for building Chatbots backed by Large Language Models
BSD 3-Clause "New" or "Revised" License
15 stars 7 forks source link

bug: unsafe content shown when continuing the chat #708

Open bderenzi opened 1 week ago

bderenzi commented 1 week ago

https://chatbots.dimagi.com/a/bmgf-demos/experiments/e/1693/session/16764/ -- you'll have to spoof as me.

Since we're saving unsafe user and bot messages to the transcript now, they are showing up when the user choose to "continue chat".

When in the original chat, it doesn't show the unsafe message. I assume it will also show unsafe human messages when resuming a chat, but didn't test

bderenzi commented 1 week ago

Chris and I offlined on this and agreed we'll wait until pipelines are ready to replace safety layers before addressing this.

SmittieC commented 1 week ago

Some background

Problem statement

In terms of safety layers, we currently save the offending human and/or AI message to the chat history. This is useful when viewing the transcript. When the AI response is the offending one, we 1. save the message to history, 2. generate a safe response and 3. return this safe response to the caller. This all happens in the process_input method inside TopicBot.

Two things to focus on:

  1. In this current approach, the user will never see the unsafe AI message unless they refresh the chat. This is sub-optimal, buggy and probably needs fixing.
  2. We cannot show the unsafe AI message in the chat UI - even in debug mode - unless the chat is being refreshed since only one message is being returned from the bot.

To solve this we 1. need a way to distinguish offending messages from non-offending ones and be able to filter them out whenever needed, and we need to be able to show the offending message in the UI, so 2. we need some way to get hold of this message.

The solution

The solution is multi-faceted.

  1. We tag the offending messages with a new tag and use this to filter out these messages when needed. This needs to be done in any case, so I created a separate issue for this.
  2. The task that gets the AI response should return an array of messages with information about each (like whether or not it is unsafe) that we then use to display to the user.

The points above are relatively easy and quick to implement. How we get that array of messages from the bot is where things start to get tricky. We have options

The proper way (LOE: High)

The process_input method should return an object instead of just a single string. This object will be the means to communicate any other data besides just the raw AI response back to the caller. Additionally, it would be great if the runnables did the same so that it can also return various other data back to the TopicBot that calls it. This "ability" might be useful in the future for other things as well, but...YAGNI.

This approach will require us to update many areas in the code that expects a single string response to now expect an object instead. Thus, the LOE is high for this one.

A less hacky way (LOE: Low)

*This approach doesn't require the task to return an array of messages, so number 2 above wouldn't be needed.

The safe AI message should have in its metadata the ID of the message that it is the safe version of. This way we don't have to return an array of messages, but only the actual AI message ID. We can then use this to find the unsafe message as well if we needed to.

The more hacky way (LOE: Low)

In this approach, we save the unsafe message as an attribute on the TopicBot class. The process_input method doesn't have to return an array of messages, so no need to update the rest of the code also. Instead, in the task's response we then simply fetch the unsafe message from the TopicBot instance that we used to generate the bot's response.

@bderenzi Not sure if this changes anything regarding the decision to wait. cc @snopoke