Closed JinliangLu96 closed 3 years ago
emm, they probably released a new version of DART.
To replicate my evaluation. The test scripts I used is here: https://worksheets.codalab.org/bundles/0x8998987a1ebe4a6dab0c7ff13122fd49
and python dart-eval2/tolower.py curr_dir ; cd dart-eval2; ./run_eval_on_dart.sh ~/curr_dir.lt2
is the command line (where curr_dir is the filename that contains prediction output).
Thank you for your answer, I will try it. Thank you much!!!
Hi! First of all, thank you both very much for your work. I had the same problem when replicating the experiment on the dart dataset using gpt2. But I can't open the link provided by lisa (it may be a long time ago), can you give me some help? Looking forward to your help very much
Hi, Lisa! I read your paper and you have done brilliant work. I want to use GPT to fine-tune the DART dataset. However, I don't know how to evaluate my results. The official scripts (https://github.com/Yale-LILY) provide a different test set (5,097 samples), which has different references, too. I use your test set (12,552 samples) to do generation and evaluate its performance based on the target sentences in the test set (12, 552 samples are aligned, so for each sample, I got only 1 reference). However, I can only get BLEU about 26.28 (GPT large), much lower than yours. Could you please answer me how to evaluate it? Thank you!