The Python Risk Identification Tool for generative AI (PyRIT) is an open access automation framework to empower security professionals and machine learning engineers to proactively find risks in their generative AI systems.
The tricky part is that for every goal/objective (e.g., "how to create a molotov cocktail") the conversation looks very different. We'll need to be able to generate the entire conversation until the n-th step and then put that into a single prompt. The assumption here has to be that the attack target is a single turn target (otherwise we can just use "normal" crescendo). So the red teaming LLM has to generate both sides of the conversation. An alternative (mentioned by Alan, the author of the paper), is to run full Crescendo and keep the questions and responses, then put them in a single prompt. That may or may not be possible in actual operations (and definitely not with single turn targets).
Importantly, the n should be configurable. The paper has some discussion of that and we probably want to be flexible.
The final solution needs to have tests and a simple notebook (like all orchestrators). There's some freedom in terms of how to do this that depends on how the conversation generation works best:
custom orchestrator: first generate conversation (which may be a single step or multiple), then send to target
converter: converter generates conversation from just the goal [It's not a typical converter, though....]
another way? If this is chosen please discuss here with dev team first.
Describe alternatives you've considered, if relevant
Alternatively, one could pregenerate such single turn crescendo templates for hundreds of goals, but that will never be comprehensive...
Additional context
One tricky aspect is that the responses need to be somewhat similar to how the target model responds. Otherwise, it may get "suspicious" (not trying to anthropomorphize here but it's the simplest way to explain what I mean) and refuse to comply.
Is your feature request related to a problem? Please describe.
We don't support single turn crescendo yet. This should be added.
Paper: https://arxiv.org/pdf/2409.03131v1
GitHub repo (results only): https://github.com/alanaqrawi/STCA
Describe the solution you'd like
The tricky part is that for every goal/objective (e.g., "how to create a molotov cocktail") the conversation looks very different. We'll need to be able to generate the entire conversation until the
n
-th step and then put that into a single prompt. The assumption here has to be that the attack target is a single turn target (otherwise we can just use "normal" crescendo). So the red teaming LLM has to generate both sides of the conversation. An alternative (mentioned by Alan, the author of the paper), is to run full Crescendo and keep the questions and responses, then put them in a single prompt. That may or may not be possible in actual operations (and definitely not with single turn targets).Importantly, the
n
should be configurable. The paper has some discussion of that and we probably want to be flexible.The final solution needs to have tests and a simple notebook (like all orchestrators). There's some freedom in terms of how to do this that depends on how the conversation generation works best:
Describe alternatives you've considered, if relevant
Alternatively, one could pregenerate such single turn crescendo templates for hundreds of goals, but that will never be comprehensive...
Additional context
One tricky aspect is that the responses need to be somewhat similar to how the target model responds. Otherwise, it may get "suspicious" (not trying to anthropomorphize here but it's the simplest way to explain what I mean) and refuse to comply.