Impact of Message Order and Role on LLM API Response Quality

Great work for opensourcing the benchmark!

https://github.com/boson-ai/RPBench-Auto/blob/765254be56d10ee6717476604bc6ae93a8367fa7/run_character_eval.py#L103
https://github.com/boson-ai/RPBench-Auto/blob/765254be56d10ee6717476604bc6ae93a8367fa7/run_scene_eval.py#L106

In the code snippet provided in eval_models_pairwise(...), the message order sent to the LLM API is as follows:

System message
Assistant message
User message

I am concerned that many LLMs are trained to handle user input first and might be expecting the user message before the assistant's greeting. This order could affect the model's response or introduce unintended behavior based on my testing.

Particularly, I have observed that changing the system message to a user role and put the "background" description as the system message often results in improved response quality in some LLMs. Can you check if this message order makes more sense or if there is a recommended approach for structuring messages to ensure optimal model performance?

Appreciate any feedback and suggestion!

boson-ai / RPBench-Auto

Impact of Message Order and Role on LLM API Response Quality #7