Azure / PyRIT

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.
MIT License
1.48k stars 265 forks source link

FEAT add many-shot jailbreaking #130

Open romanlutz opened 3 months ago

romanlutz commented 3 months ago

Is your feature request related to a problem? Please describe.

Many-shot jailbreaking as described in https://www.anthropic.com/research/many-shot-jailbreaking is not yet available in PyRIT.

Describe the solution you'd like

From a first look, it seems like all we'd need to support this is a set of (let's say 256 or more) Question/Answer pairs like in the paper.

Describe alternatives you've considered, if relevant

It's worth checking if they made it available somewhere or if there's such a Q/A dataset already.

Additional context

KutalVolkan commented 1 month ago

Hi @romanlutz,

I'd like to help out with implementing the many-shot jailbreaking feature. I'll read the paper, and if your suggestion about needing 256+ Q/A pairs seems to be correct, I'll start with that. Since this will be my first time contributing to an open-source project, could you please provide some guidance on the general steps for contributing?

Thanks! Volkan

romanlutz commented 1 month ago

Hi @KutalVolkan !

Thanks for reaching out! We'd love to collaborate on this one 🙂 I see this as two tasks really:

For the former, we have prompt templates under PyRIT/datasets/prompt_templates. Perhaps it's possible to write one that has one placeholder for where the examples would go, but then have a new subclass of PromptTemplate that can insert all the examples rather than just one? Something like

template = ManyShotTemplate.from_yaml_file(...)  # same as PromptTemplate
template.apply_parameters(prompt=prompt, examples=examples)

Where examples would be the Q&A pairs.

And then a simple orchestrator like PromptSendingOrchestrator could handle sending it to targets.

For the latter, we don't really want to become the place where all the bad stuff from the internet is collected 😄 Ideally, we would want to find these in another repository and just have an import function. Plus, people can always generate or write their own set, of course.

Regarding contributing guidelines there should be plenty in the doc folder.

Please let me know if you have questions or want to chat about any of these points! I may very well have skipped something...

KutalVolkan commented 1 month ago

Hi @romanlutz,

I'll start by reading the paper and then implement the many-shot jailbreaking feature as you described. I'll keep you updated on my progress.

Thanks, Volkan

romanlutz commented 1 month ago

Fantastic!

I guess I made an assumption here that the "many shots" are just in one prompt. Another option would be to "fake" the conversation history which is possible with some plain model endpoints but rather unlikely with full generative AI systems (which should prevent you from doing that). So I think I'd go with the single prompt and hence the prompt template makes sense.

Happy to discuss options, though!

KutalVolkan commented 1 month ago

Hello @romanlutz,

I just wanted to inform you that, according to the paper, we can use this uncensored model: WizardLM-13B-Uncensored.

We can use it to provide answers to the following questions in the "behavior" column of this dataset: harmbench_behaviors_text_all.csv.

I tried to run the model locally and encountered an issue:

UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
attn_output = torch.nn.functional.scaled_dot_product_attention(

This issue is likely not solvable according to this discussion: GitHub Issue.

Therefore, I thought about using the inference endpoints from Hugging Face instead.

P.S. Your approach of using a single prompt definitely makes sense, and I will go with that.

romanlutz commented 1 month ago

We usually use model endpoint in Azure, so I can't comment much on running locally. Maybe using an earlier version of torch helps? PyRIT shouldn't be too opinionated on which one you use.

The list of prompts you found makes sense. Still, we'd have to check in the responses somewhere. As mentioned before, I'd prefer to avoid making PyRIT the place where all the bad stuff on the internet is collected. Maybe it makes sense to put that Q&A dataset in a separate repo from where we can import it? Just thinking out loud here...

KutalVolkan commented 3 weeks ago

Hey @romanlutz,

The dataset is ready, and I will place the Q&A dataset in a separate repo. However, I will need some time to implement everything. I have a deadline on June 20, so I aim to have it all (implementation and dataset) completed by the end of June. Thanks for your patience!

romanlutz commented 3 weeks ago

Amazing, @KutalVolkan ! No pressure, of course. I'll try to provide timely feedback as usual. If you have questions please feel free to reach out.

KutalVolkan commented 2 weeks ago

Hi @romanlutz

I have completed the code implementation for the many-shot jailbreaking feature as discussed earlier.

Code Integration

Dataset Integration

Important Links

Testing Phase

The feature is currently undergoing testing. During the testing phase, one major challenge encountered is the rate limit imposed by OpenAI when a large number of examples (e.g., 100) are used.

Known Issue

openai.RateLimitError: Error code: 429 - {'error': {'message': 'Request too large for gpt-3.5-turbo in organization YOUR-ID on tokens per min (TPM): Limit 60000, Requested 74536. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}

Next Steps:

  1. Dataset Expansion: The current dataset contains 100 examples. I will expand this dataset to include more examples, aiming to reach our suggested 256+ Q&A pairs.
  2. Review and Optimize: I will conduct some reviews and optimizations, including thorough testing and verification of the entire approach, to ensure there are no logical mistakes regarding the dataset, user interactions, assistant responses, and overall methodology.

Could you please provide feedback on these steps? Any suggestions or improvements are welcome.

romanlutz commented 2 weeks ago

Amazing, @KutalVolkan !

I'll take a closer look by Monday at the latest. Don't worry about 429s. This sort of attack would require a larger context, but most of the recent models have larger ones. GPT-4-32k has 32k, for example.

For the rate limit, we've recently added retries. You can search for pyrit_target_retry for details but it exists at the target level so you don't need to worry about it.

If you create a draft PR it's easier to give detailed feedback. Just a suggestion depending on whether you have everything on a branch yet.

Again, great progress and I'll get back to you shortly.