Open yihp opened 2 months ago
Hi @yihp,
Oof, unfortunately, I think you can only use CheXbert in English. Unless you can translate to English before evaluation? But you can certainly change monitor
to something else.
OK, which monitor do you recommend for my Chinese task?
Hi @anicolson ,
I learned from your paper that CheXbert, RadGraph ER, and CXR-BERT
were intended to capture the clinical semantic similarity between the generated and radiologist reports, but these models are for English tasks and I can't reuse them. BERTscore
seems to be able to evaluate Chinese tasks. Then I have the following questions:
BERTScore
as the semantic similarity reward, but the results in your paper are not very good, and the effect of CXR-BERT is very goodmonitor: 'val_report_chexbert_f1_macro'
, do you have any suggestions for the choice of monitor
? BERTscore, CIDEr, ROUGE-L, or BLEU-4?Hi @yihp,
I am not quite sure to be honest. Maybe you could use a Chinese BERT for BERTScore? You could modify here as such: https://github.com/aehrc/cxrmate/blob/820607a5511b9cb4131b09713c32655e7d9cbb03/tools/metrics/bertscore.py#L84
Here are those options you mentioned for monitor
:
val_report_bertscore_f1
val_report_nlg_bleu_4
val_report_nlg_cider
val_report_nlg_rouge
I pushed bertscore to the repo as well.
Hi @anicolson ,
Thank you very much for your reply.
You use val_report_chexbert_f1_macro
as monitor
. I would like to ask you about the specific process. Are you use the trained cxrmate model to generate radiology reports, then let the chexbert model predict the labels (14 categories), and then calculate the chexbert_f1 value with the actual labels.
Is this the process?
Hi @yihp,
So during validation/testing, the model will generate a report. Then, the generated report and the radiologist report are passed through chexbert (giving the chexbert labels for each). Classification scores are then calculated between the chexbert labels of the generated and radiologist reports.
Hi @anicolson ,
OK, I got it. I changed the tokenizer and retrained the model, and the results are as follows:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Test metric ┃ DataLoader 0 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ test_report_chexbert_accuracy_atelectasis │ 0.6504310369491577 │
│ test_report_chexbert_accuracy_cardiomegaly │ 1.0 │
│ test_report_chexbert_accuracy_consolidation │ 1.0 │
│ test_report_chexbert_accuracy_edema │ 1.0 │
│ test_report_chexbert_accuracy_enlarged_cardiomediastinum │ 0.9993842244148254 │
│ test_report_chexbert_accuracy_example │ 0.9673740863800049 │
│ test_report_chexbert_accuracy_fracture │ 1.0 │
│ test_report_chexbert_accuracy_lung_lesion │ 1.0 │
│ test_report_chexbert_accuracy_lung_opacity │ 0.9978448152542114 │
│ test_report_chexbert_accuracy_macro │ 0.9673740863800049 │
│ test_report_chexbert_accuracy_micro │ 0.9673740863800049 │
│ test_report_chexbert_accuracy_no_finding │ 1.0 │
│ test_report_chexbert_accuracy_pleural_effusion │ 0.9910714030265808 │
│ test_report_chexbert_accuracy_pleural_other │ 1.0 │
│ test_report_chexbert_accuracy_pneumonia │ 1.0 │
│ test_report_chexbert_accuracy_pneumothorax │ 1.0 │
│ test_report_chexbert_accuracy_support_devices │ 0.9045053124427795 │
│ test_report_chexbert_f1_atelectasis │ 0.7660866379737854 │
│ test_report_chexbert_f1_cardiomegaly │ 0.0 │
│ test_report_chexbert_f1_consolidation │ 0.0 │
│ test_report_chexbert_f1_edema │ 0.0 │
│ test_report_chexbert_f1_enlarged_cardiomediastinum │ 0.0 │
│ test_report_chexbert_f1_example │ 0.5966299176216125 │
│ test_report_chexbert_f1_fracture │ 0.0 │
│ test_report_chexbert_f1_lung_lesion │ 0.0 │
│ test_report_chexbert_f1_lung_opacity │ 0.0 │
│ test_report_chexbert_f1_macro │ 0.08107323199510574 │
│ test_report_chexbert_f1_micro │ 0.7244199514389038 │
│ test_report_chexbert_f1_no_finding │ 0.0 │
│ test_report_chexbert_f1_pleural_effusion │ 0.0 │
│ test_report_chexbert_f1_pleural_other │ 0.0 │
│ test_report_chexbert_f1_pneumonia │ 0.0 │
│ test_report_chexbert_f1_pneumothorax │ 0.0 │
│ test_report_chexbert_f1_support_devices │ 0.3689386248588562 │
│ test_report_chexbert_num_dicom_ids │ 2872.0 │
│ test_report_chexbert_num_study_ids │ 1624.0 │
│ test_report_chexbert_precision_atelectasis │ 0.8176434636116028 │
│ test_report_chexbert_precision_cardiomegaly │ 0.0 │
│ test_report_chexbert_precision_consolidation │ 0.0 │
│ test_report_chexbert_precision_edema │ 0.0 │
│ test_report_chexbert_precision_enlarged_cardiomediastinum │ 0.0 │
│ test_report_chexbert_precision_example │ 0.6533148884773254 │
│ test_report_chexbert_precision_fracture │ 0.0 │
│ test_report_chexbert_precision_lung_lesion │ 0.0 │
│ test_report_chexbert_precision_lung_opacity │ 0.0 │
│ test_report_chexbert_precision_macro │ 0.0843597799539566 │
│ test_report_chexbert_precision_micro │ 0.7660516500473022 │
│ test_report_chexbert_precision_no_finding │ 0.0 │
│ test_report_chexbert_precision_pleural_effusion │ 0.0 │
│ test_report_chexbert_precision_pleural_other │ 0.0 │
│ test_report_chexbert_precision_pneumonia │ 0.0 │
│ test_report_chexbert_precision_pneumothorax │ 0.0 │
│ test_report_chexbert_precision_support_devices │ 0.3633934557437897 │
│ test_report_chexbert_recall_atelectasis │ 0.7206460237503052 │
│ test_report_chexbert_recall_cardiomegaly │ 0.0 │
│ test_report_chexbert_recall_consolidation │ 0.0 │
│ test_report_chexbert_recall_edema │ 0.0 │
│ test_report_chexbert_recall_enlarged_cardiomediastinum │ 0.0 │
│ test_report_chexbert_recall_example │ 0.5752052664756775 │
│ test_report_chexbert_recall_fracture │ 0.0 │
│ test_report_chexbert_recall_lung_lesion │ 0.0 │
│ test_report_chexbert_recall_lung_opacity │ 0.0 │
│ test_report_chexbert_recall_macro │ 0.07823583483695984 │
│ test_report_chexbert_recall_micro │ 0.6870800852775574 │
│ test_report_chexbert_recall_no_finding │ 0.0 │
│ test_report_chexbert_recall_pleural_effusion │ 0.0 │
│ test_report_chexbert_recall_pleural_other │ 0.0 │
│ test_report_chexbert_recall_pneumonia │ 0.0 │
│ test_report_chexbert_recall_pneumothorax │ 0.0 │
│ test_report_chexbert_recall_support_devices │ 0.3746556341648102 │
│ test_report_cxr-bert │ 0.7429220676422119 │
│ test_report_nlg_bleu_1 │ 0.3031856417655945 │
│ test_report_nlg_bleu_2 │ 0.03638414293527603 │
│ test_report_nlg_bleu_3 │ 0.016369516029953957 │
│ test_report_nlg_bleu_4 │ 0.0022414636332541704 │
│ test_report_nlg_cider │ 0.04183460399508476 │
│ test_report_nlg_meteor │ 0.1805824488401413 │
│ test_report_nlg_num_dicom_ids │ 2872.0 │
│ test_report_nlg_num_study_ids │ 1624.0 │
│ test_report_nlg_rouge │ 0.34699246287345886
The question is why test_report_cxr-bert is so high. Is it because cxr-bert has good Chinese generalization ability? I plan to test it.
And because I use val_report_chexbert_f1_macro
as monitor
, my task is Chinese, so the chexbert_f1
result is not referenceable. I will replace the monitor
or fine-tune a chinese_chexbert
according to you mentioned.
How do the reports look? E.g., in experiments/.../trial_0/metric_outputs/reports/...
And I was suggesting a Chinese pre-trained Transformer encoder for BERTScore, not CheXbert or CXR-BERT (because I am not sure that they exist for the later two).
Another question is that I don't see any code for calculating BERTScore? There is no BERTScore in the test results, only test_report_cxr-bert.
Please pull the repo, it has been updated
How do the reports look? E.g., in experiments/.../trial_0/metric_outputs/reports/...
And I was suggesting a Chinese pre-trained Transformer encoder for BERTScore, not CheXbert or CXR-BERT (because I am not sure that they exist for the later two).
Hi @anicolson ,
Thank you very much for your reply.
The generated reports seem to be fine, but many of the generated reports with different dicom_ids are identical, this indicates that the model's ability to generate reports is relatively poor.
Then I just tested the performance of CXR-BERT
in Chinese, and the effect was very poor, which also shows that CXR-BERT
is only for English chest X-ray tasks. But I am not sure if there is a similar Chinese BERT model that can calculate similarity and will do some test.
In addition, because CheXbert
is only applicable to English tasks, it is not realistic for me to retrain a CheXbert
in chinese language. So do you have any suggestions for the choice of monitor
on my Chinese tasks? Is a Chinese pre-trained Transformer encoder for BERTScore a good choice? Or other indicators.
Looking forward to your reply !
Hi @yihp,
I guess your best starting point would be a non-model based metric, such as a word overlap metric that is language agnostic (I assume these fit into this category, but you will have to double check: val_report_nlg_bleu_4, val_report_nlg_cider, val_report_nlg_rouge).
You could use this until you find a Chinese-based model that could be used as a metric perhaps.
Hi @anicolson ,
OK, I am doing experimental verification.
I have a question about eval_loss_step
. In the tesorboard training monitoring page, I only see train_loss_step
, but no eval_loss_step
. How should I add it?
Hi! Thanks for your contribution. It is an excellent piece of work!
My task language is Chinese. I have trained a Chinese tokenizer and trained it from scratch, but I have the following questions: Can I still use CheXbert metrics? I am still using monitor:
val_report_chexbert_f1_macro
for my training. Should I change to other monitor?Thank you very much for your time and consideration. I eagerly look forward to your response.