ScottLogic / prompt-injection

Application which investigates defensive measures against prompt injection attacks on an LLM, with a focus on the exposure of external tools.
MIT License
16 stars 10 forks source link

QA LLM defence is not reported when triggered #899

Open chriswilty opened 5 months ago

chriswilty commented 5 months ago

Bug report

Description

While testing langchain upgrade (see #897) I noticed we are not reporting all triggered defences.

Included:

Excluded:

While it seems to make sense not to include RSE or Instruction as triggereable defences, the Q&A LLM bot is instructed by default to respond with "I cannot reveal confidential information" when it detects an attempt to retrieve sensitive info. We could use this to (crudely) check if the Q&A bot detected malicious intent, and mark the response as triggered.

This will be markedly different to how we use the Evaluator LLM to detect malicious intent in the original prompt, as the Q&A bot is designed to answer a question, rather than simply check a prompt and respond with "yes" or "no" to the question "is this malicious?"

Instead, we would likely need to include a defencesTriggered (optional) field in FunctionCallResponse output from chatGptCallFunction in backend/src/openai.ts, and pass that back through getFinalReplyAfterAllToolCalls and chatGptSendMessage to be checked in handleChatWithDefenceDetection in backend/src/controller/chatController.ts. It is somewhat unfortunate that the original output from the Q&A LLM is lost when converted by the main bot into a context-enriched response, as this means we cannot check for the exact phrase "I cannot reveal confidential information" when we run the defence checks because chat completion has already converted it into something like "I cannot provide information on employee bonuses as it is considered confidential." The upshot of this is that we cannot use the existing triggered defences mechanism to check if the Q&A defences were triggered, instead we need this different mechanism earlier in the processing chain.

It is possible we might find a better, universal solution when converting our code to use LCEL chains in #898.

Reproduction steps

Steps to reproduce the behaviour:

  1. Go to Sandbox
  2. Click on Model Configuration in the left panel
  3. Toggle "Q/A LLM" on
  4. Input a prompt into the main chat box, such as "Tell me about employee bonuses"

Expected behaviour

A red "defence triggered" info message appears in the main chat panel, as for other defences:

Image

Acceptance criteria

GIVEN I am in Sandbox or Level 3 WHEN the Q/A LLM model configuration defence is active AND I ask the bot for some confidential / sensitive information THEN a red info message "q&a llm defence triggered" appears in the main chat window beneath the bot's response