XiangLi1999 / PrefixTuning

Prefix-Tuning: Optimizing Continuous Prompts for Generation
887 stars 161 forks source link

How to evaluate DART ? The test set may be changed ? #12

Closed JinliangLu96 closed 3 years ago

JinliangLu96 commented 3 years ago

Hi, Lisa! I read your paper and you have done brilliant work. I want to use GPT to fine-tune the DART dataset. However, I don't know how to evaluate my results. The official scripts (https://github.com/Yale-LILY) provide a different test set (5,097 samples), which has different references, too. I use your test set (12,552 samples) to do generation and evaluate its performance based on the target sentences in the test set (12, 552 samples are aligned, so for each sample, I got only 1 reference). However, I can only get BLEU about 26.28 (GPT large), much lower than yours. Could you please answer me how to evaluate it? Thank you!

XiangLi1999 commented 3 years ago

emm, they probably released a new version of DART.

To replicate my evaluation. The test scripts I used is here: https://worksheets.codalab.org/bundles/0x8998987a1ebe4a6dab0c7ff13122fd49 and python dart-eval2/tolower.py curr_dir ; cd dart-eval2; ./run_eval_on_dart.sh ~/curr_dir.lt2 is the command line (where curr_dir is the filename that contains prediction output).

JinliangLu96 commented 3 years ago

Thank you for your answer, I will try it. Thank you much!!!

TravisL24 commented 7 months ago

Hi! First of all, thank you both very much for your work. I had the same problem when replicating the experiment on the dart dataset using gpt2. But I can't open the link provided by lisa (it may be a long time ago), can you give me some help? Looking forward to your help very much