Closed romanlutz closed 3 months ago
Hi @romanlutz,
I'd like to help out with implementing the many-shot jailbreaking feature. I'll read the paper, and if your suggestion about needing 256+ Q/A pairs seems to be correct, I'll start with that. Since this will be my first time contributing to an open-source project, could you please provide some guidance on the general steps for contributing?
Thanks! Volkan
Hi @KutalVolkan !
Thanks for reaching out! We'd love to collaborate on this one 🙂 I see this as two tasks really:
For the former, we have prompt templates under PyRIT/datasets/prompt_templates. Perhaps it's possible to write one that has one placeholder for where the examples would go, but then have a new subclass of PromptTemplate that can insert all the examples rather than just one? Something like
template = ManyShotTemplate.from_yaml_file(...) # same as PromptTemplate
template.apply_parameters(prompt=prompt, examples=examples)
Where examples would be the Q&A pairs.
And then a simple orchestrator like PromptSendingOrchestrator could handle sending it to targets.
For the latter, we don't really want to become the place where all the bad stuff from the internet is collected 😄 Ideally, we would want to find these in another repository and just have an import function. Plus, people can always generate or write their own set, of course.
Regarding contributing guidelines there should be plenty in the doc folder.
Please let me know if you have questions or want to chat about any of these points! I may very well have skipped something...
Hi @romanlutz,
I'll start by reading the paper and then implement the many-shot jailbreaking feature as you described. I'll keep you updated on my progress.
Thanks, Volkan
Fantastic!
I guess I made an assumption here that the "many shots" are just in one prompt. Another option would be to "fake" the conversation history which is possible with some plain model endpoints but rather unlikely with full generative AI systems (which should prevent you from doing that). So I think I'd go with the single prompt and hence the prompt template makes sense.
Happy to discuss options, though!
Hello @romanlutz,
I just wanted to inform you that, according to the paper, we can use this uncensored model: WizardLM-13B-Uncensored.
We can use it to provide answers to the following questions in the "behavior" column of this dataset: harmbench_behaviors_text_all.csv.
I tried to run the model locally and encountered an issue:
UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
This issue is likely not solvable according to this discussion: GitHub Issue.
Therefore, I thought about using the inference endpoints from Hugging Face instead.
P.S. Your approach of using a single prompt definitely makes sense, and I will go with that.
We usually use model endpoint in Azure, so I can't comment much on running locally. Maybe using an earlier version of torch helps? PyRIT shouldn't be too opinionated on which one you use.
The list of prompts you found makes sense. Still, we'd have to check in the responses somewhere. As mentioned before, I'd prefer to avoid making PyRIT the place where all the bad stuff on the internet is collected. Maybe it makes sense to put that Q&A dataset in a separate repo from where we can import it? Just thinking out loud here...
Hey @romanlutz,
The dataset is ready, and I will place the Q&A dataset in a separate repo. However, I will need some time to implement everything. I have a deadline on June 20, so I aim to have it all (implementation and dataset) completed by the end of June. Thanks for your patience!
Amazing, @KutalVolkan ! No pressure, of course. I'll try to provide timely feedback as usual. If you have questions please feel free to reach out.
Hi @romanlutz
I have completed the code implementation for the many-shot jailbreaking feature as discussed earlier.
The feature is currently undergoing testing. During the testing phase, one major challenge encountered is the rate limit imposed by OpenAI when a large number of examples (e.g., 100) are used.
num_examples
parameter is set too high, an error may occur due to OpenAI's rate limits.openai.RateLimitError: Error code: 429 - {'error': {'message': 'Request too large for gpt-3.5-turbo in organization YOUR-ID on tokens per min (TPM): Limit 60000, Requested 74536. The input or output tokens must be reduced in order to run successfully. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}
Could you please provide feedback on these steps? Any suggestions or improvements are welcome.
Amazing, @KutalVolkan !
I'll take a closer look by Monday at the latest. Don't worry about 429s. This sort of attack would require a larger context, but most of the recent models have larger ones. GPT-4-32k has 32k, for example.
For the rate limit, we've recently added retries. You can search for pyrit_target_retry
for details but it exists at the target level so you don't need to worry about it.
If you create a draft PR it's easier to give detailed feedback. Just a suggestion depending on whether you have everything on a branch yet.
Again, great progress and I'll get back to you shortly.
Is your feature request related to a problem? Please describe.
Many-shot jailbreaking as described in https://www.anthropic.com/research/many-shot-jailbreaking is not yet available in PyRIT.
Describe the solution you'd like
From a first look, it seems like all we'd need to support this is a set of (let's say 256 or more) Question/Answer pairs like in the paper.
Describe alternatives you've considered, if relevant
It's worth checking if they made it available somewhere or if there's such a Q/A dataset already.
Additional context