aehrc / cvt2distilgpt2

Improving Chest X-Ray Report Generation by Leveraging Warm-Starting
GNU General Public License v3.0
64 stars 6 forks source link

When reproducing the results, it is possible to reproduce the NLG results but not the CE metric results. #9

Open mengweiwang opened 1 year ago

mengweiwang commented 1 year ago

I used the data format (only for the findings section) as R2Gen and R2GenCMN (Chen et al.) followed in this article, but I was unable to obtain the CE metric results mentioned in the paper.

I used the provided epoch=8-val_chen_cider=0.425092.ckpt model for cvt_21_to_distilgpt2 task and also tested epoch=0-val_chen_cider=0.410965.ckpt model for cvt_21_to_distilgpt2_scst task, but neither of them achieved the CE metric results mentioned in the paper.

In terms of CE metric, precision_macro can reach the result mentioned in the paper, but recall_macro and f1_macro cannot achieve it and there is a significant difference between them.

When calculating CE metrics here, only text related to findings is considered; do I need to perform any other processing?

The results obtained from performing cvt_21_to_distilgpt2 task are as follows: {'test_ce_f1_example': 0.36598095297813416, 'test_ce_f1_macro': 0.2593880891799927, 'test_ce_f1_micro': 0.4408090114593506, 'test_ce_num_examples': 3858.0, 'test_ce_precision_example': 0.4171517491340637, 'test_ce_precision_macro': 0.3600466549396515, 'test_ce_precision_micro': 0.4919118881225586, 'test_ce_recall_example': 0.3665845990180969, 'test_ce_recall_macro': 0.25423887372016907, 'test_ce_recall_micro': 0.3993246555328369, 'test_chen_bleu_1': 0.39292487502098083, 'test_chen_bleu_2': 0.24805393815040588, 'test_chen_bleu_3': 0.17164887487888336, 'test_chen_bleu_4': 0.1269991397857666, 'test_chen_cider': 0.3902686834335327, 'test_chen_meteor': 0.15456412732601166, 'test_chen_num_examples': 3858.0, 'test_chen_rouge': 0.286588191986084}

The results in the paper are as follows: precision_macro: 0.3597 recall_macro: 0.4122 f1_macro: 0.3842

The results obtained from performing cvt_21_to_distilgpt2_scst task are as follows: {'test_ce_f1_example': 0.36484676599502563, 'test_ce_f1_macro': 0.26361414790153503, 'test_ce_f1_micro': 0.4410783648490906, 'test_ce_num_examples': 3858.0, 'test_ce_precision_example': 0.4175392985343933, 'test_ce_precision_macro': 0.3873042166233063, 'test_ce_precision_micro': 0.49624764919281006, 'test_ce_recall_example': 0.3643813729286194, 'test_ce_recall_macro': 0.2558453679084778, 'test_ce_recall_micro': 0.3969484865665436, 'test_chen_bleu_1': 0.39466917514801025, 'test_chen_bleu_2': 0.248764768242836, 'test_chen_bleu_3': 0.1718045324087143, 'test_chen_bleu_4': 0.1269892156124115, 'test_chen_cider': 0.37993040680885315, 'test_chen_meteor': 0.15499255061149597, 'test_chen_num_examples': 3858.0, 'test_chen_rouge': 0.28760746121406555}

Reproduced the above content, only modifying the task parameters in task/mimic_cxr_jpg_chen/jobs.yaml.

anicolson commented 1 year ago

Hi,

Please see the updated README.md for the labels from Chen et al. https://github.com/aehrc/cvt2distilgpt2

I will look into the discrepency with the results.

mengweiwang commented 1 year ago

Hi,

Please see the updated README.md for the labels from Chen et al. aehrc/cvt2distilgpt2

I will look into the discrepency with the results.

@anicolson

Yes, I am using this dataset and the precision has reached the level reported in the paper. However, the recall rate is low and cannot reach the level reported in the paper.

Also, I have checked and tested the updated source code. The CE metric results did not change much, and there is a bug in the latest source code when running it. The bug is as follows:

The bug that occurred while I was executing the cvt_21_to_distilgpt2 task.

On line 281 of transmodal.model.py, the content is if not getattr(self, metric).compute_on_step:.

It indicates that the compute_on_step attribute does not exist.

anicolson commented 1 year ago

Hi, there are some errors in the preprint, the correct results are reported in the updated repository.