JonasGeiping / carving

Package to optimize Adversarial Attacks against (Large) Language Models with Varied Objectives
MIT License
59 stars 5 forks source link

Is there a built-in way to optimize one objective and log accuracy on another? #2

Closed dpaleka closed 1 month ago

dpaleka commented 1 month ago

For example, run GCG against one objective to get a suffix that produces some prefix of the output, and measure correctness on WMDP when questions are concatenated with that suffix.

JonasGeiping commented 1 month ago

Hi Daniel! No, the way the code is currently written some objective is optimized, and the evaluation/accuracy on other tasks is fully decoupled and tested with calls to eval_sigil.py after the optimization has finished.