Azure / PyRIT

The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.
MIT License
1.92k stars 369 forks source link

FEAT Single turn crescendo #388

Open romanlutz opened 2 months ago

romanlutz commented 2 months ago

Is your feature request related to a problem? Please describe.

We don't support single turn crescendo yet. This should be added.

Paper: https://arxiv.org/pdf/2409.03131v1

GitHub repo (results only): https://github.com/alanaqrawi/STCA

Describe the solution you'd like

The tricky part is that for every goal/objective (e.g., "how to create a molotov cocktail") the conversation looks very different. We'll need to be able to generate the entire conversation until the n-th step and then put that into a single prompt. The assumption here has to be that the attack target is a single turn target (otherwise we can just use "normal" crescendo). So the red teaming LLM has to generate both sides of the conversation. An alternative (mentioned by Alan, the author of the paper), is to run full Crescendo and keep the questions and responses, then put them in a single prompt. That may or may not be possible in actual operations (and definitely not with single turn targets).

Importantly, the n should be configurable. The paper has some discussion of that and we probably want to be flexible.

The final solution needs to have tests and a simple notebook (like all orchestrators). There's some freedom in terms of how to do this that depends on how the conversation generation works best:

Describe alternatives you've considered, if relevant

Alternatively, one could pregenerate such single turn crescendo templates for hundreds of goals, but that will never be comprehensive...

Additional context

One tricky aspect is that the responses need to be somewhat similar to how the target model responds. Otherwise, it may get "suspicious" (not trying to anthropomorphize here but it's the simplest way to explain what I mean) and refuse to comply.

roeybc commented 2 months ago

Hey! I'm up for it!

alanaqrawi commented 2 months ago

Let me know if you have questions folks