bird-bench / mini_dev

35 stars 11 forks source link

Hard to reproduce the metrics of VES and R-VES. #11

Open JimXiongGM opened 2 months ago

JimXiongGM commented 2 months ago

The Valid Efficiency Score (VES) and the Reward-based Valid Efficiency Score (R-VES) are calculated based on the execution time of SQL queries, which means the results depend on the computer hardware. Therefore, it is recommended to release the submitters' DEV prediction files on the official leaderboard so that researchers can make fair comparisons on their own machines. Thank you.

bird-bench commented 2 months ago

@JimXiongGM Thanks for your interests in our work. Make sense, we already add this into submission guidelines as required files. For previous submissions, you still need to request for dev files of theirs. However, even the execution time would depend on hardware, we already mitigate this by repeating each for 100 times, reducing outliers and setting ceilings for each. But this is a nice suggestion. Thanks!