ScottLogic / prompt-injection

Application which investigates defensive measures against prompt injection attacks on an LLM, with a focus on the exposure of external tools.
MIT License
15 stars 10 forks source link

Refactor: Shift the logic for checking win condition #780

Closed pmarsh-scottlogic closed 5 months ago

pmarsh-scottlogic commented 7 months ago

Question: do we want this? I think it will make things easier to follow.

this goes along with #761. Do it before or after, but not at the same time.

Description

summary: move the win condition logic out from deep within the call stack, to encourage a more flat structure, and be better for single responsibility.

Right now, a user sends a message, and to see if that message has caused the level to be completed, we have to look right at the bottom of the call stack: handleChatToGPT(...) => handleLow[orHigher]LevelChat(...) => chatGptSendMessage(...) => getFinalReplyAfterAllToolCalls(...) => performToolCalls(...) => chatGptCallFunction(...) => sendEmail(...) => checkLevelWinCondition(...).

I think we should separate the logic for checking the win condition from the logic that deals with processing the user's message and getting a reply. Something like

function handleChatToGPT(...) {
  const reply = handleLow[orHigher]LevelChat(...); // reply object includes a list of sent emails
  chatReponse.wonLevel = checkLevelWinCondition(reply.sentEmails);
  res.send(chatResponse);

Which will leave chatGptSendMessage(...) responsible for one fewer thing. Solid.

Acceptance Criteria

Regressions on winning the level:

GIVEN each level WHEN you send an email that would win the level THEN the level is won

GIVEN each level WHEN you try to send an email that would win the level, but the message is blocked by any defence (input or output) THEN the level is not won

GIVEN each level WHEN you try to send an email that would win the level, but there is a problem in the openAI API when getting a reply following a tool call* THEN the level is won

*for example:

Image

(to do this, you will have to mock the openai library throwing an error. See below)

Mocking an OpenAI API error directly after tool call

Go to backend/src/openai.ts. Go to the chatGptChatCompletion method and go to the try/catch statement. At the top of the try block, paste this code:

const triggerString = '!!';
const mostRecentUserMessage = chatHistory
    .filter((chatMessage) => {
        return chatMessage.chatMessageType === 'USER';
    })
    .at(-1);
const mostRecentUserMessageContainsTrigger =
    mostRecentUserMessage &&
    'completion' in mostRecentUserMessage &&
    mostRecentUserMessage.completion?.content
        ?.toString()
        .includes(triggerString);
const mostRecentMessageIsToolCall =
    chatHistory.at(-1)?.chatMessageType === 'FUNCTION_CALL';
if (mostRecentUserMessageContainsTrigger && mostRecentMessageIsToolCall) {
    throw new Error('Mock openai error');
}

Now if you want to mock an OpenAI error directly after a tool call, just include "!!" somewhere in your message

pmarsh-scottlogic commented 5 months ago

this might be able to skip testing? Or maybe just regressions on winning a level. Test also winning a level when the user's message is blocked