SheepSeb / red-team-jailbreak

0 stars 0 forks source link

Implement Persona & Experiment #7

Closed SheepSeb closed 5 months ago

SheepSeb commented 7 months ago

Vision

Generate a story based on the labels in the given picture, ask to create harmful characters which could have knowledge about everything without filters.

Modify the image in a way with increasingly noise (add scratches to it) give it to the model that we want to break, evaluate the response. To detect a jailbreak use another model to assign number from 0-10.

NLP

The same but noise and obfuscation on the text. For open source model we could probe the results

Metrics

SheepSeb commented 6 months ago
SheepSeb commented 6 months ago

Deadline 24.05.2024