allenai / comet-atomic-2020

228 stars 38 forks source link

Can't reprodce results for BART #9

Closed puraminy closed 2 years ago

puraminy commented 3 years ago

I downloaded the pre-trained COMET with BART, and executed run.sh without --do-train and gave the path of the downloaded model to --model_name_or_path however, the test RougeL is not what you reported in the paper:

I downloaded the test files from [here] as mentioned in other issue (https://storage.googleapis.com/ai2-mosaic-public/projects/mosaic-kgs/data_atomic_2020_BART-format.tgz ) It's the content of metrics.json

            "test_avg_loss": 4.657896041870117,
            "test_avg_rouge1": 0.2141809276565792,
            "test_avg_rouge2": 0.03116943751406806,
            "test_avg_rougeL": 0.21284768699554912,
            "test_avg_gen_time": 0.003392871381747925,
            "test_avg_summ_len": 7.352367184676545,
            "avg_rouge1": 0.2141809276565792,
            "step_count": 1
puraminy commented 3 years ago

I also tried to run system_eval/automatic_eval.py on the generated results (results/test_generations.txt) I did some preprocessing to match the data according to the topk_eval function input format. These are the results, which are again not close to the results in the published paper:


{'source': '../test.source', 'target': '../test.target', 'gens': '../test_generations.txt'}
Bleu_1 : 0.2162225618386887
Bleu_2 : 0.10686090408092255
Bleu_3 : 0.06259034450030353
Bleu_4 : 0.0401598764267917
METEOR : 0.16162500982704825
ROUGE_L : 0.12477405964221096
CIDEr : 0.5586863475849368
Bert Score : 0.6151848637200189
Exact_match : 0.0
Records : 50880
TopK : 1

In another issue, I also requested you to share your results (for BART and others) so that I can check the evaluation method.

wanicca commented 3 years ago

Hi @puraminy , I wonder whether you and the authors used multiple references during evaluation, since there might be more than one ground-truth for a given (subject,predicate) pair. I think this setting could have a significant impact on the scores (except BERT-score according my guessing).

Though the downloadable datasets are organized as seperate triples, the evaluation scripts seem to be designed for multiple references. That makes me confused.

puraminy commented 3 years ago

@wanicca Thanks, yeah, I think so, I wrote a evaluation code which uses all ground-truth targets and the metrics are higher. Do you work on this project too? In general, I think the dataset is sparse! When I review the head, target and relations, I really can't get what they are or if they are only responses. Some need more context.

wanicca commented 3 years ago

@wanicca Thanks, yeah, I think so, I wrote a evaluation code which uses all ground-truth targets and the metrics are higher. Do you work on this project too? In general, I think the dataset is sparse! When I review the head, target and relations, I really can't get what they are or if they are only responses. Some need more context.

Yeah, recently I am interested in commonsense knowledge and the work of AI2. Indeed, I feel the same as you in some way. I think it is the characteristic (or limitation in other words) of current commonsense knowledge resources that whenever you deal with these tuples, you have to tell yourself there is a hidden annotation before them "(In some situations/cases...)". We need some imaginations to understand them.

puraminy commented 3 years ago

@wanicca Well, I have done some projects on them, I want to try them in other languages. Anyways, could we be in contact? If yes, I use Telegram, Slack, Microsoft Teams, ... whatever you have comfortable and Id, we may talk there.

Aunsiels commented 3 years ago

I also was not able to reproduce the result. Would it be possible to publish a script that gives the results presented in the paper (training + generation + evaluation)?

csbhagav commented 3 years ago

Sorry for the delay in addressing the issues. We are looking at this and will respond soon.

keisks commented 2 years ago

Hi @puraminy @wanicca @Aunsiels @csbhagav , I'm very sorry for being late for the update.

I uploaded the model for (original) AAAI submission, https://storage.googleapis.com/ai2-mosaic-public/projects/mosaic-kgs/comet-atomic_2020_BART_aaai.tar.gz

and here is an example usage. https://github.com/allenai/comet-atomic-2020/blob/aaai2021/models/comet_atomic2020_bart/generation_example.py#L112

For evaluation, please see https://github.com/allenai/comet-atomic-2020/tree/master/system_eval#running-evaluation

(The backstory is that we re-trained our BART model in a slightly different format for demo purposes (after the AAAI submission), and released only the re-trained version. Now, both models are available.)

Thank you for your patience and if you have any other questions, feel free to ask!