CBLUEbenchmark / CBLUE

中文医疗信息处理基准CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark
https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us
Apache License 2.0
727 stars 128 forks source link

Were the baselines trained with dev set as well? #7

Closed MattYoon closed 2 years ago

MattYoon commented 2 years ago

Hi.

Were the baselines trained on both train set and dev set before testing, or was it trained on train set only?

I used the exact hyper-parameters mentioned on your paper, and used the baseline codes to test out hfl/chinese-bert-wwm-ext on CMeEE. I got some confusing results as with train + dev I got 62.8, with train only I got 60.7. The test score mentioned in your paper is 61.7.

Can you please tell me which way the baselines were tested? Thank You.

flow3rdown commented 2 years ago

Hi, @MattYoon: Our baselines only trained on the train set. The results may be influenced by the different hardware, you can change the hyper-parameters and train again.

MattYoon commented 2 years ago

Thank you for your fast response!

Did you use early stopping to obtain the test results? i.e. when it says a certain model was trained for 5 epochs, did you pick the best performing epoch based on the dev result?

flow3rdown commented 2 years ago

Yes, we set the training epochs in advance and select the best model based on the dev results. We don't use early stopping on CMedEE, CMedIE, CDN, and CTC tasks.

xxllp commented 2 years ago

请问下 基线的得分是在test 上面的还是dev上面的