MiuLab / PLM-ICD

PLM-ICD: Automatic ICD Coding with Pretrained Language Models
Apache License 2.0
52 stars 19 forks source link

Discrepancy in Model Performance Between MIMIC-50 and MIMIC-Full Datasets with Automatic Mixed Precision #9

Closed AHMAD-DOMA closed 1 month ago

AHMAD-DOMA commented 10 months ago

I have encountered a performance discrepancy between two different datasets, MIMIC-50 and MIMIC-Full, while using automatic mixed precision in my model. I followed the same configuration settings and training parameters for both datasets, aiming to reproduce the results from a research paper. While the results for MIMIC-50 are relatively close to the expected outcomes, the results for MIMIC-Full exhibit a notable discrepancy.

Details:

  1. MIMIC-50 Configuration:

    • --max_length: 3072
    • --chunk_size: 128
    • --model_name_or_path: RoBERTa-base-PM-M3-Voc/RoBERTa-base-PM-M3-Voc-hf
    • --per_device_train_batch_size: 1
    • --gradient_accumulation_steps: 8
    • --per_device_eval_batch_size: 1
    • --num_train_epochs: 20
    • --num_warmup_steps: 2000
    • --model_type: roberta
    • --model_mode: laat

    Results for MIMIC-50 with Automatic Mixed Precision:

    • For the best threshold (0.45):
      • f1_micro: 66.96
      • prec_micro: 67.02
      • rec_micro: 66.89

    Paper F1 Result: 71.00

  2. MIMIC-Full Configuration:

    • --max_length: 3072
    • --chunk_size: 128
    • --model_name_or_path: RoBERTa-base-PM-M3-Voc/RoBERTa-base-PM-M3-Voc-hf
    • --per_device_train_batch_size: 1
    • --gradient_accumulation_steps: 8
    • --per_device_eval_batch_size: 1
    • --num_train_epochs: 20
    • --num_warmup_steps: 2000
    • --model_type: roberta
    • --model_mode: laat

    Results for MIMIC-Full with Automatic Mixed Precision:

    • For the best threshold (0.2):
      • f1_micro: 13.68
      • prec_micro: 35.35
      • rec_micro: 8.48

    Paper F1 Result: 59.8

I kindly request assistance in diagnosing and resolving the performance issue encountered with the MIMIC-Full dataset. The goal is to align the results with the paper's reported metrics as closely as possible.

Thank you for your attention and support in addressing this matter.

FareedKhan-dev commented 10 months ago

Hello @AHMAD-DOMA, I have also successfully reproduced the results presented in the paper for the MIMIC-3 Full dataset, although I haven't yet done so for the top 50 codes. I utilized the same parameters as in the paper and determined the optimal threshold to be 0.5. The resulting metrics are as follows:

Best Threshold: 0.5

Performance Metrics:

Macro Accuracy: 0.0589
Macro Precision: 0.0984
Macro Recall: 0.0727
Macro F1 Score: 0.0836   
Micro Accuracy: 0.4059
Micro Precision: 0.7148  <---
Micro Recall: 0.4844     <---
Micro F1 Score: 0.5775   <---
Precision at 8: 0.7644
Recall at 8: 0.4026
F1 Score at 8: 0.5274
Macro AUC: 0.9237
Micro AUC: 0.9892
AHMAD-DOMA commented 10 months ago

Thank you, @FareedKhan-dev. If my understanding is correct, you followed the preprocessing steps as described in the README and conducted training for 20 epochs. If that's correct, could you please share the configurations of your experiment?

FareedKhan-dev commented 10 months ago

I apologize for any misunderstanding. To clarify, I didn't perform any training but did perform preprocessing. My results are generated based on a pre-trained model they have provided in README

chaoweihuang commented 6 months ago

Hi @AHMAD-DOMA,

Thank you for your interest in our work! I'd say that the discrepancy in the MIMIC-full configuration is so significant that I suspect there's something wrong with the training process. The following factors might be relevant:

I'd suggest using the pretrained checkpoints directly as it's the easiest way to replicate the results.

Best, Chao-Wei