OpenLMLab / LEval

[ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
GNU General Public License v3.0
349 stars 14 forks source link

Validation / test split #4

Closed howard50b closed 1 year ago

howard50b commented 1 year ago

Hi, thanks for the timely resources! I have a question regarding the dataset splits -- I noticed that the dataset seems to only have a test set. Is that by design? If so how to prevent overfitting (even if it's just tuning the prompts) or is there a plan to add a validation set?

Thank you!

ChenxinAn-fdu commented 1 year ago

Hi!!! Thank you for reminding us of the overfitting issue. Since we have released the ground truth of the test set 😫, adding a validation set seems cannot prevent tuning prompts on the test set if developers find that the performance is unsatisfactory. To prevent overfitting, we recommend that developers also report their results using our prompt, which is nearly identical across all baselines, or compare their baseline systems using the same prompt they've designed.

howard50b commented 1 year ago

Got it. Thanks for the reply!