Implement Persona & Experiment

SheepSeb commented 7 months ago

Vision

Generate a story based on the labels in the given picture, ask to create harmful characters which could have knowledge about everything without filters.

Modify the image in a way with increasingly noise (add scratches to it) give it to the model that we want to break, evaluate the response. To detect a jailbreak use another model to assign number from 0-10.

NLP

The same but noise and obfuscation on the text. For open source model we could probe the results

Metrics

Plots with number of tries before the jailbreak
Attack success rate on models
Number of tokens used vs asr
Transferability on ASR
Perplexity of the attack
Maximize the mean for the jugde

SheepSeb commented 6 months ago

Remove the attacker to compare with the attacker
Manual check after the judge
Put temperature and seed on judge
Toxicity BERT for score as judge
Check with different judges :llama3 for llama3 and so on

SheepSeb commented 6 months ago

Deadline 24.05.2024

SheepSeb / red-team-jailbreak

Implement Persona & Experiment #7

Vision

NLP

Metrics