What about recent strong models?

dschaehi commented 3 days ago

Hi @AndreasMadsen, thanks for the interesting work! Since the evaluation of your work only includes models from last year, I am curious whether the conclusions are valid for recent strong models from OpenAI and Anthropic so that one can still use your paper as a reference. Do you have an idea? Thanks!

AndreasMadsen commented 3 days ago

We didn't check those because results can't be reproduced, and their license doesn't allow criticism. They may perform better on some tasks, but there is nothing that would suggest that they provide generally faithful explanations. These models are to the best of our knowledge optimized for human preference, which is at minimum orthogonal to faithfulness.

dschaehi commented 3 days ago

Thanks for your answer! I still think it would be interesting to know to what extend recent models provide more faithful explanations, even if 100% faithfulness cannot be attained with them. One might ask, for example, whether Claude Sonnet provides more faithful explanations than GPT-4o and to what extend.

their license doesn't allow criticism.

This is interesting 🤔

AndreasMadsen commented 2 days ago

whether Claude Sonnet provides more faithful explanations than GPT-4o and to what extend.

Agreed. As mentioned in our limitation section, we are only able to give a binary classification of faithfulness, and then measure the frequency of that. Indeed, it would be more useful to measure the degree of faithfulness. This idea is generally noted by Jacovi and Goldberg [1].

[1] Jacovi, A., & Goldberg, Y. (2020). Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness? Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4198–4205. https://doi.org/10.18653/v1/2020.acl-main.386

dschaehi commented 1 day ago

their license doesn't allow criticism.

This is interesting 🤔

FYI, the following is how ChatGPT replied. Don't know how faithful the reply is though...

No, the claim as presented is misleading and based on a misunderstanding of how license agreements, indemnity clauses, and liability work for users of AI models like GPT-4. Let’s break it down:

License Agreements and Indemnity Clauses

GPT-4, as provided by OpenAI, is typically accessed under a terms of service (ToS) or license agreement. These agreements usually contain: • Indemnity clauses: These often require users to indemnify the company (e.g., OpenAI) if the user’s improper or unauthorized use of the model causes legal issues. • However, such clauses do not make users financially liable simply for publishing research, critique, or analysis of the AI model. Liability generally applies if the user breaches the terms (e.g., using the model for prohibited purposes like harmful activities).

Faithfulness of Explanations

Faithfulness (i.e., how well the explanations provided by the model align with its underlying reasoning) is an ongoing area of research. Critiquing or demonstrating the model’s limitations in this regard is part of normal scientific inquiry and does not automatically expose the researcher to liability.

Risk of Financial Liability

There is no known legal precedent where a researcher was held financially liable for merely analyzing or critiquing the explanations or outputs of an AI model. • Companies behind these products (e.g., OpenAI) are unlikely to pursue legal action against researchers for objective, good-faith academic work, even if it points out flaws in their models. • Most indemnity clauses in ToS are designed to protect the company from malicious or illegal activities performed using their service—not from independent academic research.

Conflict of Interest

The suggestion that researchers face a conflict of interest due to potential liability is speculative and not supported by any legal norm or common industry practice. In fact, many companies encourage external research and feedback.

Conclusion

The claim appears to misunderstand the intent and scope of indemnity clauses in AI model licenses. Researchers are generally not at personal financial risk for conducting unbiased evaluations or publishing results on AI models like GPT-4. However, it’s always a good idea for researchers to:

Read the terms of service of the AI model they are using.

Consult legal counsel if they have specific concerns, particularly if their research involves sensitive or high-stakes scenarios.

AndreasMadsen commented 20 hours ago

It's my decision what I'm comfortable doing in terms of the OpenAI's ToS. If you wish to repeat our analysis on OpenAI and other proprietary Als, then I would definitely be interested in seeing such a publication.

This is what OpenAI's ToS it says:

If you are a business or organization, to the extent permitted by law, you will indemnify and hold harmless us, our affiliates, and our personnel, from and against any costs, losses, liabilities, and expenses (including attorneys’ fees) from third party claims arising out of or relating to your use of the Services and Content or any violation of these Terms.

You can compare this with LLama 2, which has the much narrower scope of just indemnify IP claims:

c. If you institute litigation or other proceedings against Meta or any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Llama 2 outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by you, then any licenses granted to you under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Meta from and against any claim by any third party arising out of or related to your use or distribution of the Llama Materials.

I'm not interested in giving any legal opinion on this. "license doesn't allow criticism" is definitely an oversimplification, and just reflect my personal decision briefly. As for most chat-based responses, due to sycophantic behavior the answer you get will depend on how you ask the question.

AndreasMadsen / llm-introspection

What about recent strong models? #1