Please cite the following paper:
@inproceedings{huang-etal-2022-plm,
title = "{PLM}-{ICD}: Automatic {ICD} Coding with Pretrained Language Models",
author = "Huang, Chao-Wei and Tsai, Shang-Chi and Chen, Yun-Nung",
booktitle = "Proceedings of the 4th Clinical Natural Language Processing Workshop",
month = jul,
year = "2022",
address = "Seattle, WA",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.clinicalnlp-1.2",
pages = "10--20",
}
pip3 install -r requirements.txt
Unfortunately, we are not allowed to redistribute the MIMIC dataset.
Please follow the instructions from caml-mimic to preprocess the MIMIC-2 and MIMIC-3 dataset and place the files under data/mimic2
and data/mimic3
respectively.
Please download the pretrained LMs you want to use from the following link:
--model_name_or_path microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract
when training the model, the script will download the checkpoint automatically.You can also download our trained models to skip the training part. We provide 3 trained models:
cd src
python3 run_icd.py \
--train_file ../data/mimic3/train_full.csv \
--validation_file ../data/mimic3/dev_full.csv \
--max_length 3072 \
--chunk_size 128 \
--model_name_or_path ../models/RoBERTa-base-PM-M3-Voc-distill-align-hf \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--per_device_eval_batch_size 1 \
--num_train_epochs 20 \
--num_warmup_steps 2000 \
--output_dir ../models/roberta-mimic3-full \
--model_type roberta \
--model_mode laat
--model_type [bert|longformer]
.--code_50 --code_file ../data/mimic3/ALL_CODES_50.txt
--code_file ../data/mimic2/ALL_CODES.txt
cd src
python3 run_icd.py \
--train_file ../data/mimic3/train_full.csv \
--validation_file ../data/mimic3/test_full.csv \
--max_length 3072 \
--chunk_size 128 \
--model_name_or_path ../models/roberta-mimic3-full \
--per_device_eval_batch_size 1 \
--num_train_epochs 0 \
--output_dir ../models/roberta-mimic3-full \
--model_type roberta \
--model_mode laat