Model migration consultation

yihp commented 2 months ago

Hi! Thanks for your contribution. It is an excellent piece of work!

My task language is Chinese. I have trained a Chinese tokenizer and trained it from scratch, but I have the following questions: Can I still use CheXbert metrics? I am still using monitor: val_report_chexbert_f1_macro for my training. Should I change to other monitor?

Thank you very much for your time and consideration. I eagerly look forward to your response.

anicolson commented 2 months ago

Hi @yihp,

Oof, unfortunately, I think you can only use CheXbert in English. Unless you can translate to English before evaluation? But you can certainly change monitor to something else.

yihp commented 2 months ago

OK, which monitor do you recommend for my Chinese task?

yihp commented 2 months ago

Hi @anicolson ,

I learned from your paper that CheXbert, RadGraph ER, and CXR-BERT were intended to capture the clinical semantic similarity between the generated and radiologist reports, but these models are for English tasks and I can't reuse them. BERTscore seems to be able to evaluate Chinese tasks. Then I have the following questions:

I can use BERTScore as the semantic similarity reward, but the results in your paper are not very good, and the effect of CXR-BERT is very good
Because CheXbert is only applicable to English tasks, I have to change the monitor: 'val_report_chexbert_f1_macro', do you have any suggestions for the choice of monitor? BERTscore, CIDEr, ROUGE-L, or BLEU-4?

anicolson commented 2 months ago

Hi @yihp,

I am not quite sure to be honest. Maybe you could use a Chinese BERT for BERTScore? You could modify here as such: https://github.com/aehrc/cxrmate/blob/820607a5511b9cb4131b09713c32655e7d9cbb03/tools/metrics/bertscore.py#L84

Here are those options you mentioned for monitor:

val_report_bertscore_f1
val_report_nlg_bleu_4
val_report_nlg_cider
val_report_nlg_rouge

I pushed bertscore to the repo as well.

yihp commented 2 months ago

Hi @anicolson ,

Thank you very much for your reply. You use val_report_chexbert_f1_macro as monitor. I would like to ask you about the specific process. Are you use the trained cxrmate model to generate radiology reports, then let the chexbert model predict the labels (14 categories), and then calculate the chexbert_f1 value with the actual labels.

Is this the process?

anicolson commented 2 months ago

Hi @yihp,

So during validation/testing, the model will generate a report. Then, the generated report and the radiologist report are passed through chexbert (giving the chexbert labels for each). Classification scores are then calculated between the chexbert labels of the generated and radiologist reports.

yihp commented 2 months ago

Hi @anicolson ,

OK, I got it. I changed the tokenizer and retrained the model, and the results are as follows:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                        Test metric                        ┃                       DataLoader 0                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│         test_report_chexbert_accuracy_atelectasis         │                    0.6504310369491577                     │
│        test_report_chexbert_accuracy_cardiomegaly         │                            1.0                            │
│        test_report_chexbert_accuracy_consolidation        │                            1.0                            │
│            test_report_chexbert_accuracy_edema            │                            1.0                            │
│ test_report_chexbert_accuracy_enlarged_cardiomediastinum  │                    0.9993842244148254                     │
│           test_report_chexbert_accuracy_example           │                    0.9673740863800049                     │
│          test_report_chexbert_accuracy_fracture           │                            1.0                            │
│         test_report_chexbert_accuracy_lung_lesion         │                            1.0                            │
│        test_report_chexbert_accuracy_lung_opacity         │                    0.9978448152542114                     │
│            test_report_chexbert_accuracy_macro            │                    0.9673740863800049                     │
│            test_report_chexbert_accuracy_micro            │                    0.9673740863800049                     │
│         test_report_chexbert_accuracy_no_finding          │                            1.0                            │
│      test_report_chexbert_accuracy_pleural_effusion       │                    0.9910714030265808                     │
│        test_report_chexbert_accuracy_pleural_other        │                            1.0                            │
│          test_report_chexbert_accuracy_pneumonia          │                            1.0                            │
│        test_report_chexbert_accuracy_pneumothorax         │                            1.0                            │
│       test_report_chexbert_accuracy_support_devices       │                    0.9045053124427795                     │
│            test_report_chexbert_f1_atelectasis            │                    0.7660866379737854                     │
│           test_report_chexbert_f1_cardiomegaly            │                            0.0                            │
│           test_report_chexbert_f1_consolidation           │                            0.0                            │
│               test_report_chexbert_f1_edema               │                            0.0                            │
│    test_report_chexbert_f1_enlarged_cardiomediastinum     │                            0.0                            │
│              test_report_chexbert_f1_example              │                    0.5966299176216125                     │
│             test_report_chexbert_f1_fracture              │                            0.0                            │
│            test_report_chexbert_f1_lung_lesion            │                            0.0                            │
│           test_report_chexbert_f1_lung_opacity            │                            0.0                            │
│               test_report_chexbert_f1_macro               │                    0.08107323199510574                    │
│               test_report_chexbert_f1_micro               │                    0.7244199514389038                     │
│            test_report_chexbert_f1_no_finding             │                            0.0                            │
│         test_report_chexbert_f1_pleural_effusion          │                            0.0                            │
│           test_report_chexbert_f1_pleural_other           │                            0.0                            │
│             test_report_chexbert_f1_pneumonia             │                            0.0                            │
│           test_report_chexbert_f1_pneumothorax            │                            0.0                            │
│          test_report_chexbert_f1_support_devices          │                    0.3689386248588562                     │
│            test_report_chexbert_num_dicom_ids             │                          2872.0                           │
│            test_report_chexbert_num_study_ids             │                          1624.0                           │
│        test_report_chexbert_precision_atelectasis         │                    0.8176434636116028                     │
│        test_report_chexbert_precision_cardiomegaly        │                            0.0                            │
│       test_report_chexbert_precision_consolidation        │                            0.0                            │
│           test_report_chexbert_precision_edema            │                            0.0                            │
│ test_report_chexbert_precision_enlarged_cardiomediastinum │                            0.0                            │
│          test_report_chexbert_precision_example           │                    0.6533148884773254                     │
│          test_report_chexbert_precision_fracture          │                            0.0                            │
│        test_report_chexbert_precision_lung_lesion         │                            0.0                            │
│        test_report_chexbert_precision_lung_opacity        │                            0.0                            │
│           test_report_chexbert_precision_macro            │                    0.0843597799539566                     │
│           test_report_chexbert_precision_micro            │                    0.7660516500473022                     │
│         test_report_chexbert_precision_no_finding         │                            0.0                            │
│      test_report_chexbert_precision_pleural_effusion      │                            0.0                            │
│       test_report_chexbert_precision_pleural_other        │                            0.0                            │
│         test_report_chexbert_precision_pneumonia          │                            0.0                            │
│        test_report_chexbert_precision_pneumothorax        │                            0.0                            │
│      test_report_chexbert_precision_support_devices       │                    0.3633934557437897                     │
│          test_report_chexbert_recall_atelectasis          │                    0.7206460237503052                     │
│         test_report_chexbert_recall_cardiomegaly          │                            0.0                            │
│         test_report_chexbert_recall_consolidation         │                            0.0                            │
│             test_report_chexbert_recall_edema             │                            0.0                            │
│  test_report_chexbert_recall_enlarged_cardiomediastinum   │                            0.0                            │
│            test_report_chexbert_recall_example            │                    0.5752052664756775                     │
│           test_report_chexbert_recall_fracture            │                            0.0                            │
│          test_report_chexbert_recall_lung_lesion          │                            0.0                            │
│         test_report_chexbert_recall_lung_opacity          │                            0.0                            │
│             test_report_chexbert_recall_macro             │                    0.07823583483695984                    │
│             test_report_chexbert_recall_micro             │                    0.6870800852775574                     │
│          test_report_chexbert_recall_no_finding           │                            0.0                            │
│       test_report_chexbert_recall_pleural_effusion        │                            0.0                            │
│         test_report_chexbert_recall_pleural_other         │                            0.0                            │
│           test_report_chexbert_recall_pneumonia           │                            0.0                            │
│         test_report_chexbert_recall_pneumothorax          │                            0.0                            │
│        test_report_chexbert_recall_support_devices        │                    0.3746556341648102                     │
│                   test_report_cxr-bert                    │                    0.7429220676422119                     │
│                  test_report_nlg_bleu_1                   │                    0.3031856417655945                     │
│                  test_report_nlg_bleu_2                   │                    0.03638414293527603                    │
│                  test_report_nlg_bleu_3                   │                   0.016369516029953957                    │
│                  test_report_nlg_bleu_4                   │                   0.0022414636332541704                   │
│                   test_report_nlg_cider                   │                    0.04183460399508476                    │
│                  test_report_nlg_meteor                   │                    0.1805824488401413                     │
│               test_report_nlg_num_dicom_ids               │                          2872.0                           │
│               test_report_nlg_num_study_ids               │                          1624.0                           │
│                   test_report_nlg_rouge                   │                    0.34699246287345886

The question is why test_report_cxr-bert is so high. Is it because cxr-bert has good Chinese generalization ability? I plan to test it. And because I use val_report_chexbert_f1_macro as monitor, my task is Chinese, so the chexbert_f1 result is not referenceable. I will replace the monitor or fine-tune a chinese_chexbert according to you mentioned.

anicolson commented 2 months ago

How do the reports look? E.g., in experiments/.../trial_0/metric_outputs/reports/...

And I was suggesting a Chinese pre-trained Transformer encoder for BERTScore, not CheXbert or CXR-BERT (because I am not sure that they exist for the later two).

yihp commented 2 months ago

Another question is that I don't see any code for calculating BERTScore? There is no BERTScore in the test results, only test_report_cxr-bert.

anicolson commented 2 months ago

Please pull the repo, it has been updated

yihp commented 2 months ago

How do the reports look? E.g., in experiments/.../trial_0/metric_outputs/reports/...

And I was suggesting a Chinese pre-trained Transformer encoder for BERTScore, not CheXbert or CXR-BERT (because I am not sure that they exist for the later two).

Hi @anicolson ,

Thank you very much for your reply.

The generated reports seem to be fine, but many of the generated reports with different dicom_ids are identical, this indicates that the model's ability to generate reports is relatively poor.

Then I just tested the performance of CXR-BERT in Chinese, and the effect was very poor, which also shows that CXR-BERT is only for English chest X-ray tasks. But I am not sure if there is a similar Chinese BERT model that can calculate similarity and will do some test.

In addition, because CheXbert is only applicable to English tasks, it is not realistic for me to retrain a CheXbert in chinese language. So do you have any suggestions for the choice of monitor on my Chinese tasks? Is a Chinese pre-trained Transformer encoder for BERTScore a good choice? Or other indicators.

Looking forward to your reply !

anicolson commented 2 months ago

Hi @yihp,

I guess your best starting point would be a non-model based metric, such as a word overlap metric that is language agnostic (I assume these fit into this category, but you will have to double check: val_report_nlg_bleu_4, val_report_nlg_cider, val_report_nlg_rouge).

You could use this until you find a Chinese-based model that could be used as a metric perhaps.

yihp commented 2 months ago

Hi @anicolson ,

OK, I am doing experimental verification.

I have a question about eval_loss_step. In the tesorboard training monitoring page, I only see train_loss_step, but no eval_loss_step. How should I add it?

aehrc / cxrmate

Model migration consultation #14