AI-secure / DecodingTrust

A Comprehensive Assessment of Trustworthiness in GPT Models
https://decodingtrust.github.io/
Creative Commons Attribution Share Alike 4.0 International
231 stars 51 forks source link

Analysis request for blog finding that "GPT-4 is more vulnerable than GPT-3.5" #50

Open crizCraig opened 3 months ago

crizCraig commented 3 months ago

This seems like a very important finding mentioned in your blog and something deserving of further exposition.

Submitting your paper to Gemini 1.5 and Claude Opus with this question I get:

Gemini

Instances where GPT-4 is more vulnerable than GPT-3.5: The paper highlights several instances where, despite its overall improved capabilities, GPT-4 exhibits greater vulnerability compared to GPT-3.5:

  1. Toxicity under "Jailbreaking" Prompts: Following Instructions Precisely: Due to its enhanced ability to follow instructions, GPT-4 is more susceptible to manipulation through adversarial system prompts designed to bypass content policies and elicit toxic responses. While GPT-3.5 may resist or misinterpret such prompts, GPT-4 is more likely to comply, leading to higher toxicity scores.
  2. Stereotype Bias under Misleading Prompts: Precision in Following Instructions: Similar to the toxicity case, GPT-4's precise adherence to instructions makes it vulnerable to misleading prompts designed to elicit biased outputs. GPT-4 may interpret and follow such prompts more accurately than GPT-3.5, leading to a higher likelihood of generating biased content.
  3. Robustness to Adversarial Demonstrations: Pattern-Following Ability: GPT-4's stronger pattern-following capabilities can backfire when exposed to backdoored demonstrations. It may identify and learn the backdoor triggers more effectively than GPT-3.5, leading to higher vulnerability to backdoor attacks.
  4. Privacy Leakage under Specific Prompts: Following Instructions Diligently: In certain situations, GPT-4's diligence in following instructions can lead to privacy leaks. For instance, when prompted with specific questions about privacy events and instructed with phrases like "confidentially," GPT-4 may interpret it as a directive to share information, leading to higher leakage rates compared to GPT-3.5.
  5. Machine Ethics - Vulnerability to Jailbreaking and Evasive Sentences: Adherence to Instructions: GPT-4's strong adherence to instructions makes it susceptible to jailbreaking prompts that encourage it to disregard moral principles. Additionally, it is more likely to be swayed by evasive sentences that attempt to justify or downplay immoral actions, leading to misclassification compared to GPT-3.5. Overall, these instances demonstrate that GPT-4's strength in following instructions can become a weakness when confronted with malicious prompts or demonstrations. This highlights the need for further research on aligning LLMs with human values and developing robust safeguards against potential manipulation.

Claude Opus

Yes, the paper provides several examples where GPT-4 is found to be more vulnerable than GPT-3.5:

  1. Toxicity: GPT-4 is more likely to follow "jailbreaking" system prompts and thus demonstrates higher toxicity than GPT-3.5 given different adversarial system prompts and task prompts.

  2. Stereotype bias: Under targeted adversarial system prompts, GPT-4 is more vulnerable and agrees with biased content more frequently than GPT-3.5, potentially because GPT-4 follows misleading instructions more precisely.

  3. Adversarial demonstrations: Providing backdoored demonstrations misleads both GPT-3.5 and GPT-4, but GPT-4 is more vulnerable to backdoored demonstrations.

  4. Privacy: GPT-4 is more likely to leak privacy than GPT-3.5 given the constructed prompts under different privacy-related words and events, potentially because it follows the privacy-leakage guiding instructions more precisely.

  5. Machine ethics: GPT-4 is easier to manipulate than GPT-3.5 by jailbreaking prompts to misrecognize immoral actions. It is also more vulnerable to evasive sentences that make immoral actions seem accidental, harmless or imaginary.

In summary, the paper finds that in adversarial settings with malicious prompts or demonstrations, GPT-4 is often more vulnerable than GPT-3.5, likely because it follows the misleading instructions more precisely due to its stronger instruction-following capabilities.

crizCraig commented 3 months ago

Overall it seems like GPT-4's ability to follow instructions is why it is more vulnerable to targeted attacks. Perhaps there should be an AI safety evaluation dimension of "misuse" where the model is more prone to misuse by bad actors.