Open bderenzi opened 1 week ago
Chris and I offlined on this and agreed we'll wait until pipelines are ready to replace safety layers before addressing this.
Some background
In terms of safety layers, we currently save the offending human and/or AI message to the chat history. This is useful when viewing the transcript. When the AI response is the offending one, we 1. save the message to history, 2. generate a safe response and 3. return this safe response to the caller. This all happens in the process_input
method inside TopicBot
.
Two things to focus on:
To solve this we 1. need a way to distinguish offending messages from non-offending ones and be able to filter them out whenever needed, and we need to be able to show the offending message in the UI, so 2. we need some way to get hold of this message.
The solution is multi-faceted.
The points above are relatively easy and quick to implement. How we get that array of messages from the bot is where things start to get tricky. We have options
The process_input
method should return an object instead of just a single string. This object will be the means to communicate any other data besides just the raw AI response back to the caller. Additionally, it would be great if the runnables did the same so that it can also return various other data back to the TopicBot
that calls it. This "ability" might be useful in the future for other things as well, but...YAGNI.
This approach will require us to update many areas in the code that expects a single string response to now expect an object instead. Thus, the LOE is high for this one.
*This approach doesn't require the task to return an array of messages, so number 2 above wouldn't be needed.
The safe AI message should have in its metadata the ID of the message that it is the safe version of. This way we don't have to return an array of messages, but only the actual AI message ID. We can then use this to find the unsafe message as well if we needed to.
In this approach, we save the unsafe message as an attribute on the TopicBot
class. The process_input
method doesn't have to return an array of messages, so no need to update the rest of the code also. Instead, in the task's response
we then simply fetch the unsafe message from the TopicBot
instance that we used to generate the bot's response.
@bderenzi Not sure if this changes anything regarding the decision to wait. cc @snopoke
https://chatbots.dimagi.com/a/bmgf-demos/experiments/e/1693/session/16764/ -- you'll have to spoof as me.
Since we're saving unsafe user and bot messages to the transcript now, they are showing up when the user choose to "continue chat".
When in the original chat, it doesn't show the unsafe message. I assume it will also show unsafe human messages when resuming a chat, but didn't test