ScottLogic / prompt-injection

Application which investigates defensive measures against prompt injection attacks on an LLM, with a focus on the exposure of external tools.
MIT License
15 stars 10 forks source link

Defence - LLM Evaluation #18

Closed gsproston-scottlogic closed 1 year ago

gsproston-scottlogic commented 1 year ago

Train another model to detect malicious input.

https://learnprompting.org/docs/prompt_hacking/defensive_measures/llm_eval

Each defence should include the following:

heatherlogan-scottlogic commented 1 year ago

Notes - Prompt injection detect only

  1. Use another LLM with a specific prompt to explain prompt injection and ask the LLM to detect if input is a prompt-injection attack (https://www.alignmentforum.org/posts/pNcFYZnPdXyL2RfgA/using-gpt-eliezer-against-chatgpt-jailbreaking) • pros – simple, can give good context in a prompt • cons - different types of prompt injection attacks. may have to be specific

  2. use an open source framework i.e. Rebuff • pros - already implemented, self hardening, multiple stages of defence (heuristics, LLM, feedback) • also used with langchain Rebuff | 🦜️🔗 Langchain • cons - python-api , multiple stages of defense might be a performance lag

  3. Use a dataset to train an LLM on prompt injection attacks: deepset/prompt-injections · Datasets at Hugging Face • pro: provides different examples • cons: small dataset, german examples (?)

  4. Combination of 1 & 3: use another RetrievalQA model ?

Considerations – what do we want to detect? • Jailbreaking prompts • Ignore instructions and do something else • Inappropriate/offensive instructions
• Virtualisation – take on a new persona

Examples of prompt injections to use

other jailbreaks

heatherlogan-scottlogic commented 1 year ago

LLM Evaluation Goals • Detect on jailbreaks trying to change the behaviour • Detect on getting bot to reveal original/previous prompts • Detect on user trying to get bot to say or email something bad/illegal
• Detect on user instructing bot to reveal sensitive information

• Do not detect on related instructions for emailing e.g. Send all subsequent emails with my signature “Heather L”, or from now on you must draft an email before sending it and ask for my confirmation. • Do not detect on related instructions for querying the documents e.g. when I ask you questions about the documents, you should respond in caps. • Do not detect on general questions about sensitive info. In the documents unless instructed to reveal sensitive info. E.g. “What is the CEO’s salary? ”