Azure / PyRIT

The Python Risk Identification Tool for generative AI (PyRIT) is an open source framework built to empower security professionals and engineers to proactively identify risks in generative AI systems.
https://azure.github.io/PyRIT/
MIT License
1.95k stars 372 forks source link

Does the score effect the red team bot? #74

Closed ysy970923 closed 9 months ago

ysy970923 commented 9 months ago

I have one question.

while not correct_password and attempt < 3:
    # Step 1. AI Red Team generate automatic prompt
    red_team_response = red_teamer_bot.complete_chat_user(message=gandalf_text_response)
    print(f"Attempt #{attempt}")
    print("Sending the following to Gandalf:")
    print(red_team_response)
    print()

    # Step 2. Send jailbreak idea to Gandalf
    gandalf_response = gandalf_bot.complete_text(text=red_team_response)
    gandalf_text_response = gandalf_response.completion
    print("Response from Gandalf:")
    print(gandalf_text_response)
    print()

    # Step 3. Score the response
    score = gandalf_password_scorer.score_text(text=gandalf_text_response)

    # Step 4. Ask if we should continue
    print(f"Score: {score.score_value}")
    if score.score_value:
        correct_password = gandalf_text_response
        print(f"Found the correct password in {attempt + 1} attempts!\n")
        break

    attempt += 1

It seems that there is no mechanism for the calculated score to effect the red team bot. image

But based on the image above, the scoring engine feedback is effecting the prompt generated.

Please let me know if I missed something. Thank you.

romanlutz commented 9 months ago

Hi @ysy970923 , thanks for submitting this question! You're right in that the feedback mechanism in this example is just the response from the target LLM.

More generally, we can envision cases where the score is very much required. Let's say our target LLM generates images (which we can't pass back into the LLM directly), then we need a score (or textual feedback) to be passed back to the Red Teaming LLM. As you can see, there are different kinds of setups, but we didn't want to overcomplicate the diagram either, so it just mentions feedback from the scoring engine.

I hope that helps! Please let us know if you have further thoughts on the topic otherwise we'll close the issue within the next 7 days.

ysy970923 commented 9 months ago

Thanks so much for the response 👍

Are there plans for adding examples for image generation? If not can I contribute some examples for text to image models?

romanlutz commented 9 months ago

Yes, that's definitely relevant.

I'm not sure to what extent you already have these or are planning to work on that, but you can certainly open a PR if you already have it and we can comment there. If you're just starting out it's probably faster and simpler to write up a short outline and share (in a new issue since the original question was answered and is unrelated) so that you can get quick feedback before spending too much time on it. What do you think?

ysy970923 commented 9 months ago

I did some work on this, so I made a pull request. Feel free to give comments :) Thank you