MiuLab / PLM-ICD

PLM-ICD: Automatic ICD Coding with Pretrained Language Models
Apache License 2.0
52 stars 19 forks source link

Accuracy throughout the training keeps decreasing regardless the validation data and method #12

Closed LWserenic closed 4 weeks ago

LWserenic commented 3 months ago

Hello @chaoweihuang, love your research. For context, I tried to implement your framework to identify ICD Code 10 for MIMIC-IV data. So far I haven't been successful as in the accuracy keeps decreasing, while the loss also decreases yet all value for the preds return as below 0.5. I already tried to use the model you used in your paper with no result. I even tried to use the data used in training to be in validation to see if the model will overfit and the result is no, the validation metrics keep decreasing to 0. For information, I use the latest transformer module with the latest accelerate (although by the time of this writing I tried to use Trainer module from huggingface). I use the preprocessing from caml repo that have been adapted for MIMIC-IV and so far the dataset shows the same form as MIMIC-III that you already shown in one of the issues. The tokenizer seems fine since the text can be decoded, even with the default vocab from the pretrained model. Hopefully you can give a bit of insight for this predicament of mine. Thank you.

chaoweihuang commented 2 months ago

Hi @LWserenic ,

Thank you for your interest in our work! I suspect that this is a data issue. For starters, have you created a different ALL_CODES.txt for your dataset? AFAIK ICD code 9 and 10 have different codes so they're not compatible. You'll need to create a new ALL_CODES.txt by collecting all codes in your dataset.

Best, Chao-Wei

LWserenic commented 2 months ago

Hello @chaoweihuang , thanks for the reply. I think I finally able to solve it by using the data preprocessing from Joakim Edin, using Decision Boundary tuning, and changing some hyperparameters, the metrics is getting better.