Implement TextImage & Experiment

Vision

Generate a story based on the labels in the given picture, ask to create harmful characters which could have knowledge about everything without filters.

Modify the image to have a hidden message in it (aia cu camuflatul). To detect a jailbreak use another model to assign number from 0-10. The fitness of a query will be given by the number from the judge.

NLP

The same but noise and obfuscation on the text. Use GA to improve the story to give a more explicit answer.

Metrics

Plots with number of tries before the jailbreak
Attack success rate on models
Number of tokens used vs asr
Transferability on ASR
Perplexity of the attack
Maximize the mean for the jugde

SheepSeb / red-team-jailbreak

Implement TextImage & Experiment #6

Vision

NLP

Metrics