Evaluation Method for BIRD Dataset [Enhancement]

AlibabaResearch / DAMO-ConvAI

DAMO-ConvAI: The official repository which contains the codebase for Alibaba DAMO Conversational AI.

MIT License

1.22k stars 186 forks source link

Evaluation Method for BIRD Dataset [Enhancement] #159

Open lucaordronneau opened 3 months ago

lucaordronneau commented 3 months ago

Hello,

I encountered an improvement opportunity during the evaluation process for the BIRD dataset. The prediction below is marked as incorrect by the evaluation method, but the only difference is the order of the elements.

The evaluation method uses strict equality. This occurs in the file bird/llm/src/evaluation.py on line 26.

bird-bench commented 2 months ago

@lucaordronneau Thanks for interests in our work. Yes, the EX is more strict, we considered the returned orders should also be one of user requirements. Just imagine the agent return a long list with messy orders, which make users very annoying. However, if you dis consider this, you can try out our new metrics for beta testing SOFT-FT, which contains detailed elaborations here: soft-f1. Thanks.