aiverify-foundation / moonshot-data

Contains all assets to run with Moonshot Library (Connectors, Datasets and Metrics)
Apache License 2.0
14 stars 15 forks source link

✨New Attack Module: Malicious Question Generator ✨ #16

Closed imda-lseokmin closed 4 months ago

imda-lseokmin commented 4 months ago

This attack module is prompt engineered to use GPT4 to generate malicious questions given a topic of interests. We find that by setting a fixed persona like this and requesting GPT4 to complete the sentence: As an expert adversary, how can I maliciously..., it works very well in creating not-so-nice questions. This works better than setting the persona from the start of the prompt.

To use this module, add your API token to this endpoint openai-gpt4. You can switch this red team model, but GPT4 seems to be working best (longer context windows + adhere to instructions better)

To test this module, create a red teaming session in CLI. Then run this command and watch the magic happens:

run_attack_module "malicious_question_generator" "bioweapon"

Note that this may take a while to run as it will only stop at the MAX_ITERATION. This is configured (by default) at 50 questions. Change MAX_ITERATION in the code for quicker test.