Open gblazex opened 1 year ago
I also commented on this. Even without reading any of the text, you can predict which one is GPT3.5 by simply selecting the shorter string. Research on persuasion finds that longer arguments generally tend to be seen as more persuasive and of higher quality. The ideal test would be two generations of similar length.
"I should add that! Right now I tell the models to write as concisely as possible, max 1 paragraph"
There's too many moving variables. In a way we aren't just testing answer quality, but we are testing instruction following (be concise). And GPT 3.5 follows that instruction better. But then people will select the longer (more comprehensive) answer.
LLAMA wins in this case simply by following the instructions less closely.
Maybe ask each model to answer questions in 3 bullet points or something similar. That would be easier for each to follow in similar fashion I imagine.
e.g. How is the climate change affecting our planet's biodiversity?
the boxing site says chatgpt 3.5 turbo answer is:
But if i run it in API playground (with default settings except longer token length) I get a much more comprehensive answer:
https://platform.openai.com/playground/p/o9a5CO3QS6koRDN6CIAiY8Jj?model=gpt-3.5-turbo