Open lucaordronneau opened 3 months ago
@lucaordronneau Thanks for interests in our work. Yes, the EX is more strict, we considered the returned orders should also be one of user requirements. Just imagine the agent return a long list with messy orders, which make users very annoying. However, if you dis consider this, you can try out our new metrics for beta testing SOFT-FT, which contains detailed elaborations here: soft-f1. Thanks.
Hello,
I encountered an improvement opportunity during the evaluation process for the BIRD dataset. The prediction below is marked as incorrect by the evaluation method, but the only difference is the order of the elements.
The evaluation method uses strict equality. This occurs in the file bird/llm/src/evaluation.py on line 26.