Generate a story based on the labels in the given picture, ask to create harmful characters which could have knowledge about everything without filters.
Modify the image in a way with increasingly noise (add scratches to it) give it to the model that we want to break, evaluate the response.
To detect a jailbreak use another model to assign number from 0-10.
NLP
The same but noise and obfuscation on the text. For open source model we could probe the results
Vision
Generate a story based on the labels in the given picture, ask to create harmful characters which could have knowledge about everything without filters.
Modify the image in a way with increasingly noise (add scratches to it) give it to the model that we want to break, evaluate the response. To detect a jailbreak use another model to assign number from 0-10.
NLP
The same but noise and obfuscation on the text. For open source model we could probe the results
Metrics