Open Leepoet opened 1 year ago
Hi, thanks for your interest in our work. Note that for radiology report generation, the precision is more important than the diversity of the reports. For the validity of our method, our work follows the method of the most notable work in this area: we utilized six widely-used evaluation metrics to gauge the performance of our model. We also observed the same phenomenon during our experiments on different models, e.g., R2Gen (see this issue) , R2GenCMN, on IU-Xray dataset. The possible reason could be the IU-Xray has both the frontal and lateral views, hence it's difficult for the visual extractor to capure the diffrence between different samples, hence the model is likely to generate the similar reports. Besides, IU-Xray is a small dataset, hence the diversity itself in the report is smaller than the MIMIC-CXR dataset. Hope this can help you figure out the problem.
Hi, thanks for your reply. I have tried to reproduce the work of R2GenCMN, and finally found that its best model and best results are generated around the 25th epoch, which is within the acceptable range in my opinion. As I said earlier, the best model obtained in the first few epochs is usually not of reference value. In my recent repro experiments, I got pretty good results in the first epoch, but in the end found that only one kind of report was generated. Does this mean that the model was not well trained in the first few epochs so that the diversity of the generated reports is poor but the six evaluation indicators of the results are good? Further, if this is the case, please forgive my bold doubts, then the validity of the method you propose may lack some convincing.
Hi, thanks for your reply. I have tried to reproduce the work of R2GenCMN, and finally found that its best model and best results are generated around the 25th epoch, which is within the acceptable range in my opinion. As I said earlier, the best model obtained in the first few epochs is usually not of reference value. In my recent repro experiments, I got pretty good results in the first epoch, but in the end found that only one kind of report was generated. Does this mean that the model was not well trained in the first few epochs so that the diversity of the generated reports is poor but the six evaluation indicators of the results are good? Further, if this is the case, please forgive my bold doubts, then the validity of the method you propose may lack some convincing.
Hi, I guess the epoch the best performance occured is also influenced by the hyper-parameters such as learning rate and the working environment in addition to the method itself. As we mentioned earlier, our work follows the method of the most notable work in this area such as R2Gen and R2GenCMN: we utilized six widely-used evaluation metrics to gauge the performance of our model. In addition, from my perspective, the problem is that the NLP evaluation metrics may not reflect the true performance of the model, which is a common problem in text generation tasks. This is why we normally focus more on the larger dataset such as MIMIC-CXR to mitigate this problem. Moreover, the higher diversity is not always with the higher precision. SME involvement is required to truly gauge this.
Maybe you are right, there is a possibility that something in the iu-xray dataset itself is causing this to happen. In response to your answer, I can agree with most of your points. By the way, I also appreciate your patient reply and wonderful work contribution. Thank you.
Maybe you are right, there is a possibility that something in the iu-xray dataset itself is causing this to happen. In response to your answer, I can agree with most of your points. By the way, I also appreciate your patient reply and wonderful work contribution. Thank you.
Never mind, and thank you for your interest to our work and the conrete discussion. Please feel free to contact again if you have any other questions.
Maybe you are right, there is a possibility that something in the iu-xray dataset itself is causing this to happen. In response to your answer, I can agree with most of your points. By the way, I also appreciate your patient reply and wonderful work contribution. Thank you.
Hello! I've encountered the same problem as you! I also achieved very good results in the 1st epoch, but the generated sentences are all repetitive. I would like to share my thoughts and discuss them with you:
I have tried R2Gen, R2GenCMN, and XProNet, and their results on IU-XRay were very unstable. (You mentioned that R2GenCMN had the highest value at the 25th epoch, but I also experienced cases where the highest value occurred in the first five epochs). I have also modified my own model and encountered situations where the 1st epoch had very high results.
Currently, everyone(the previous papars) is taking the best validation result, and the evaluation metrics do not include a measure of diversity. I think there might not be a good solution to this problem at the moment. Taking the average of results from all epochs or just using the results from the final epoch doesn't seem appropriate either.
In addition, I found that using LSTM as the decoder, compared to Transformer, can result in better diversity, but I don't understand the specific reasons behind it.
However, on MIMIC-CXR, the above situation was largely alleviated, and the results were relatively stable. At least in the experiments I conducted, I did not encounter cases where the first five epochs had very high results. Perhaps we can explore more on MIMIC-CXR.
I think that we need better and more reasonable metrics to evaluate the ability of radiology report generation models😂~
@Leepoet Hello Leepoet , I have repeated the experiment many times and it is difficult to get the results of iu-xray dataset. Can you share the parameters of utils.py on the iu-xray dataset, or the random seed?
嗨,我成功地复制了您的工作并获得了与您论文中描述的完全相同的结果。但我发现了一个现象,即在 iu-xray 数据集上进行实验时,最好的模型和结果出现在第 3 个时期。这种现象是否表明论文中提出的方法的有效性需要重新讨论?您能解释一下这种现象是否合理吗?一般来说,使用在前 epoch 上获取的 checkpoint 来生成相应的报告,多样性较差。我试过用最好的再训练模型和你提供的最佳模型来生成相应的报告,最后发现确实是这样。以下是我得到的实验日志的摘录,其中包含最好的结果,以证明我成功地重现了结果。
07/24/2023 16:11:39 - 信息 - modules.trainer - [3/30] 开始在验证集中进行评估。 07/24/2023 16:12:32 - 信息 - modules.trainer - [3/30] 开始在测试集中进行评估。 07/24/2023 16:13:57 - 信息 - modules.trainer - 纪元:3 07/24/2023 16:13:57 - 信息 - modules.trainer - ce_loss:2.3583324741023457 07/24/2023 16:13:57 - 信息 - modules.trainer - img_con:0.010452255175282905 07/24/2023 16:13:57 - 信息 - modules.trainer - txt_con:0.02370573818510355 07/24/2023 16:13:57 - 信息 - modules.trainer - img_bce_loss:0.6931472420692444 - 信息 - modules.trainer - :0.693147242069244407/24/2023 16:13:57 - 信息 - modules.trainer - txt_bce_loss : 0.6931472420692444 07/24/2023 16:13:57 - 信息 - modules.trainer - val_BLEU_1 : 0.4875411346726625 07/24/2023 16:13:57 - 信息 - modules.trainer - val_BLEU_2 : 0.32324968962851985 07/24/2023 16:13:57 - 信息 - modules.trainer - val_BLEU_3 : 0.2303989906968061 07/24/2023 16:13:57 - 信息 - modules.trainer - val_BLEU_4 : 0.1689297437555314407/24/2023 16:13:57 - 信息 - modules.trainer - val_METEOR : 0.19912841341017073 07/24/2023 16:13:57 - 信息 - modules.trainer - val_ROUGE_L : 0.3893886781595059 07/24/2023 16:13:57 - 信息 - modules.trainer - test_BLEU_1 : 0.5247745358089907 07/24/2023 16:13:57 - 信息 - modules.trainer - test_BLEU_2 : 0.35656897214407807 07/24/2023 16:13:57 - 信息 - modules.trainer - test_BLEU_3 : 0.262052362966512507/24/2023 16:13:57 - 信息 - modules.trainer - test_BLEU_4 : 0.19875032988045743 07/24/2023 16:13:57 - 信息 - modules.trainer - test_METEOR : 0.21969653608856185 07/24/2023 16:13:57 - 信息 - modules.trainer - test_ROUGE_L : 0.4113942119889325 07/24/2023 16:14:09 - 信息 - modules.trainer - 保存检查点:/data/XProNet/results_RETRAIN_withReportGen/iu_xray/current_checkpoint.pth ... 07/24/2023 16:14:30 - 信息 - modules.trainer - 保存当前最佳:model_best.pth ...
@Leepoet Hello Leepoet , I have repeated the experiment many times and it is difficult to get the results of iu-xray dataset. Can you share the parameters of utils.py on the iu-xray dataset, or the random seed?
Hi, I have successfully reproduced your work and got the exact same results as described in your paper. But I found a phenomenon that when experimenting on the iu-xray dataset, the best model and results appeared in the 3rd epoch. Does this phenomenon indicate that the validity of the method proposed in the paper needs to be re-discussed? Can you explain to me whether this phenomenon is reasonable? Generally speaking, using checkpoints obtained on the previous epochs to generate corresponding reports has poor diversity. I have tried using the best retrained model and the best model you provided to generate the corresponding report, and finally found that this is the case. Below is an excerpt of the experiment log I got with the best results to demonstrate that I successfully reproduced the results.
07/24/2023 16:11:39 - INFO - modules.trainer - [3/30] Start to evaluate in the validation set. 07/24/2023 16:12:32 - INFO - modules.trainer - [3/30] Start to evaluate in the test set. 07/24/2023 16:13:57 - INFO - modules.trainer - epoch : 3 07/24/2023 16:13:57 - INFO - modules.trainer - ce_loss : 2.3583324741023457 07/24/2023 16:13:57 - INFO - modules.trainer - img_con : 0.010452255175282905 07/24/2023 16:13:57 - INFO - modules.trainer - txt_con : 0.02370573818510355 07/24/2023 16:13:57 - INFO - modules.trainer - img_bce_loss : 0.6931472420692444 07/24/2023 16:13:57 - INFO - modules.trainer - txt_bce_loss : 0.6931472420692444 07/24/2023 16:13:57 - INFO - modules.trainer - val_BLEU_1 : 0.4875411346726625 07/24/2023 16:13:57 - INFO - modules.trainer - val_BLEU_2 : 0.32324968962851985 07/24/2023 16:13:57 - INFO - modules.trainer - val_BLEU_3 : 0.2303989906968061 07/24/2023 16:13:57 - INFO - modules.trainer - val_BLEU_4 : 0.16892974375553144 07/24/2023 16:13:57 - INFO - modules.trainer - val_METEOR : 0.19912841341017073 07/24/2023 16:13:57 - INFO - modules.trainer - val_ROUGE_L : 0.3893886781595059 07/24/2023 16:13:57 - INFO - modules.trainer - test_BLEU_1 : 0.5247745358089907 07/24/2023 16:13:57 - INFO - modules.trainer - test_BLEU_2 : 0.35656897214407807 07/24/2023 16:13:57 - INFO - modules.trainer - test_BLEU_3 : 0.2620523629665125 07/24/2023 16:13:57 - INFO - modules.trainer - test_BLEU_4 : 0.19875032988045743 07/24/2023 16:13:57 - INFO - modules.trainer - test_METEOR : 0.21969653608856185 07/24/2023 16:13:57 - INFO - modules.trainer - test_ROUGE_L : 0.4113942119889325 07/24/2023 16:14:09 - INFO - modules.trainer - Saving checkpoint: /data/XProNet/results_RETRAIN_withReportGen/iu_xray/current_checkpoint.pth ... 07/24/2023 16:14:30 - INFO - modules.trainer - Saving current best: model_best.pth ...