andyzoujm / representation-engineering

Representation Engineering: A Top-Down Approach to AI Transparency
https://www.ai-transparency.org/
MIT License
734 stars 88 forks source link

The confusion about how to generate dishonest responses #6

Closed caesar-jojo closed 1 year ago

caesar-jojo commented 1 year ago

Hello, author, how can I generate dishonest responses?

andyzoujm commented 1 year ago

Hi, the simplest (but probably not the best performant way) of generating dishonest responses is to use a negative coefficient in the control portion of the honesty.ipynb notebook.

justinphan3110 commented 1 year ago

@caesar-jojo Hi, we have just added an example honesty_mistral that generate dishonest response with mistralai/Mistral-7B-Instruct-v0.1