Open crizCraig opened 3 months ago
Overall it seems like GPT-4's ability to follow instructions is why it is more vulnerable to targeted attacks. Perhaps there should be an AI safety evaluation dimension of "misuse" where the model is more prone to misuse by bad actors.
This seems like a very important finding mentioned in your blog and something deserving of further exposition.
Submitting your paper to Gemini 1.5 and Claude Opus with this question I get:
Gemini
Instances where GPT-4 is more vulnerable than GPT-3.5: The paper highlights several instances where, despite its overall improved capabilities, GPT-4 exhibits greater vulnerability compared to GPT-3.5:
Claude Opus
Yes, the paper provides several examples where GPT-4 is found to be more vulnerable than GPT-3.5:
Toxicity: GPT-4 is more likely to follow "jailbreaking" system prompts and thus demonstrates higher toxicity than GPT-3.5 given different adversarial system prompts and task prompts.
Stereotype bias: Under targeted adversarial system prompts, GPT-4 is more vulnerable and agrees with biased content more frequently than GPT-3.5, potentially because GPT-4 follows misleading instructions more precisely.
Adversarial demonstrations: Providing backdoored demonstrations misleads both GPT-3.5 and GPT-4, but GPT-4 is more vulnerable to backdoored demonstrations.
Privacy: GPT-4 is more likely to leak privacy than GPT-3.5 given the constructed prompts under different privacy-related words and events, potentially because it follows the privacy-leakage guiding instructions more precisely.
Machine ethics: GPT-4 is easier to manipulate than GPT-3.5 by jailbreaking prompts to misrecognize immoral actions. It is also more vulnerable to evasive sentences that make immoral actions seem accidental, harmless or imaginary.
In summary, the paper finds that in adversarial settings with malicious prompts or demonstrations, GPT-4 is often more vulnerable than GPT-3.5, likely because it follows the misleading instructions more precisely due to its stronger instruction-following capabilities.