Jester6136 / pubmedBERT-BiLSTM-CRF

Code for our ICTA 2023 paper "An Architecture for More Fine-grained Hidden Representation in Named Entity Recognition for Biomedical Texts"
Other
6 stars 1 forks source link

PubmedBERT - BiLSTM - CRF

Code for our ICTA 2023 paper "An Architecture for More Fine-grained Hidden Representation in Named Entity Recognition for Biomedical Texts". Please cite our paper if you find this repository helpful in your research:

@InProceedings{10.1007/978-3-031-49529-8_13,
author="Tho, Bui Duc
and Giang, Son-Ba
and Nguyen, Minh-Tien
and Nguyen, Tri-Thanh",
editor="Nghia, Phung Trung
and Thai, Vu Duc
and Thuy, Nguyen Thanh
and Son, Le Hoang
and Huynh, Van-Nam",
title="An Architecture for More Fine-Grained Hidden Representation in Named Entity Recognition for Biomedical Texts",
booktitle="Advances in Information and Communication Technology",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="114--125",
abstract="This paper introduces a model for Biomedical Named Entity Recognition (BioNER). Different from existing models that mainly rely on pre-trained models, i.e., PubMedBERT, the proposed model is empowered by using PubMedBERT as the main backbone for mapping input sequences to contextual vectors. To learn more fine-grained hidden representation and effectively adapt to the recognition downstream task, the model stacks BiLSTM and CRFs on top of PubMedBERT. Given an input sentence, the model first maps the sentence into contextual vectors by PubMedBERT. The vectors are next fed into a BiLSTM layer for learning a more fine-grained hidden representation that serves as the input for sequence labeling by using CRFs. We confirm the efficiency of the model on benchmark corpora. Experimental results on 29 diverse datasets indicate that the proposed model obtains promising results compared to good as well as state-of-the-art baselines. The ablation study also shows the behavior of the model in several aspects.",
isbn="978-3-031-49529-8"
}

This project implements our PubmedBERT - BiLSTM - CRF. The implementation is build upon fairseq, and heavily inspired by CLNER, many thanks to the authors for making their code avaliable.

Guide

Requirements

The project is based on PyTorch 1.1+ and Python 3.6+. To run our code, install:

pip install -r requirements.txt

The following requirements should be satisfied:

Datasets

The datasets used in our paper are available here.

Training

Training NER Models

Run:

CUDA_VISIBLE_DEVICES=0 python train.py --config config/pubmed_bilstm_crf.yaml

Dataset config

To set the dataset manully, you can set the dataset in the $config_file by:

targets: ner
ner:
  Corpus: ColumnCorpus-1
  ColumnCorpus-1: 
    data_folder: datasets/MTL-Bioinformatics-2016/AnatEM-IOB
    column_format:
      0: text
      1: ner
    tag_to_bioes: ner
  tag_dictionary: resources/taggers/your_ner_tags.pkl

The tag_dictionary is a path to the tag dictionary for the task. If the path does not exist, the code will generate a tag dictionary at the path automatically. The dataset format is: Corpus: $CorpusClassName-$id, where $id is the name of datasets (anything you like). You can train multiple datasets jointly. For example:

Please refer to Config File for more details.

Config File

The config files are based on yaml format.