AI (Artificial Intelligence) plays an indispensable role in the biomedical field, helping improve medical technology. For further accelerating AI research in the biomedical field, we present Chinese Biomedical Language Understanding Evaluation (CBLUE), including datasets collected from real-world biomedical scenarios, baseline models, and an online platform for model evaluation, comparison, and analysis.
We evaluate the current 11 Chinese pre-trained models on the eight biomedical language understanding tasks and report the baselines of these tasks.
Model | CMedEE | CMedIE | CDN | CTC | STS | QIC | QTR | QQR | Avg. |
---|---|---|---|---|---|---|---|---|---|
BERT-base | 62.1 | 54.0 | 55.4 | 69.2 | 83.0 | 84.3 | 60.0 | 84.7 | 69.0 |
BERT-wwm-ext-base | 61.7 | 54.0 | 55.4 | 70.1 | 83.9 | 84.5 | 60.9 | 84.4 | 69.4 |
ALBERT-tiny | 50.5 | 35.9 | 50.2 | 61.0 | 79.7 | 75.8 | 55.5 | 79.8 | 61.1 |
ALBERT-xxlarge | 61.8 | 47.6 | 37.5 | 66.9 | 84.8 | 84.8 | 62.2 | 83.1 | 66.1 |
RoBERTa-large | 62.1 | 54.4 | 56.5 | 70.9 | 84.7 | 84.2 | 60.9 | 82.9 | 69.6 |
RoBERTa-wwm-ext-base | 62.4 | 53.7 | 56.4 | 69.4 | 83.7 | 85.5 | 60.3 | 82.7 | 69.3 |
RoBERTa-wwm-ext-large | 61.8 | 55.9 | 55.7 | 69.0 | 85.2 | 85.3 | 62.8 | 84.4 | 70.0 |
PCL-MedBERT | 60.6 | 49.1 | 55.8 | 67.8 | 83.8 | 84.3 | 59.3 | 82.5 | 67.9 |
ZEN | 61.0 | 50.1 | 57.8 | 68.6 | 83.5 | 83.2 | 60.3 | 83.0 | 68.4 |
MacBERT-base | 60.7 | 53.2 | 57.7 | 67.7 | 84.4 | 84.9 | 59.7 | 84.0 | 69.0 |
MacBERT-large | 62.4 | 51.6 | 59.3 | 68.6 | 85.6 | 82.7 | 62.9 | 83.5 | 69.6 |
Human | 67.0 | 66.0 | 65.0 | 78.0 | 93.0 | 88.0 | 71.0 | 89.0 | 77.1 |
We present the baseline models on the biomedical tasks and release corresponding codes for a quick start.
python3 / pytorch 1.7 / transformers 4.5.1 / jieba / gensim / sklearn
The whole zip package includes the datasets of 8 biomedical NLU tasks (more detail in the following section). Every task includes the following files:
├── {Task}
| └── {Task}_train.json
| └── {Task}_test.json
| └── {Task}_dev.json
| └── example_gold.json
| └── example_pred.json
| └── README.md
Notice: a few tasks have additional files, e.g. it includes 'category.xlsx' file in the CHIP-CTC task.
You can download Chinese pre-trained models according to your need (download URLs are provided above). With Huggingface-Transformers , the models above could be easily accessed and loaded.
The reference directory:
├── CBLUE
| └── baselines
| └── run_classifier.py
| └── ...
| └── examples
| └── run_qqr.sh
| └── ...
| └── cblue
| └── CBLUEDatasets
| └── KUAKE-QQR
| └── ...
| └── data
| └── output
| └── model_data
| └── bert-base
| └── ...
| └── result_output
| └── KUAKE-QQR_test.json
| └── ...
The shell files of training and evaluation for every task are provided in examples/
, and could directly run.
Also, you can utilize the running codes in baselines/
, and write your shell files according to your need:
baselines/run_classifer.py
: support {sts, qqr, qtr, qic, ctc, ee}
tasks;baselines/run_cdn.py
: support {cdn}
task;baselines/run_ie.py
: support {ie}
task.Running shell files: bash examples/run_{task}.sh
, and the contents of shell files are as follow:
DATA_DIR="CBLUEDatasets"
TASK_NAME="qqr"
MODEL_TYPE="bert"
MODEL_DIR="data/model_data"
MODEL_NAME="chinese-bert-wwm"
OUTPUT_DIR="data/output"
RESULT_OUTPUT_DIR="data/result_output"
MAX_LENGTH=128
python baselines/run_classifier.py \
--data_dir=${DATA_DIR} \
--model_type=${MODEL_TYPE} \
--model_dir=${MODEL_DIR} \
--model_name=${MODEL_NAME} \
--task_name=${TASK_NAME} \
--output_dir=${OUTPUT_DIR} \
--result_output_dir=${RESULT_OUTPUT_DIR} \
--do_train \
--max_length=${MAX_LENGTH} \
--train_batch_size=16 \
--eval_batch_size=16 \
--learning_rate=3e-5 \
--epochs=3 \
--warmup_proportion=0.1 \
--earlystop_patience=3 \
--logging_steps=250 \
--save_steps=250 \
--seed=2021
Notice: the best checkpoint is saved in OUTPUT_DIR/MODEL_NAME/
.
MODEL_TYPE
: support {bert, roberta, albert, zen}
model types;MODEL_NAME
: support {bert-base, bert-wwm-ext, albert-tiny, albert-xxlarge, zen, pcl-medbert, roberta-large, roberta-wwm-ext-base, roberta-wwm-ext-large, macbert-base, macbert-large}
Chinese pre-trained models.The MODEL_TYPE
-MODEL_NAME
mappings are listed below.
MODEL_TYPE | MODEL_NAME |
---|---|
bert |
bert-base , bert-wwm-ext , pcl-medbert , macbert-base , macbert-large |
roberta |
roberta-large , roberta-wwm-ext-base , roberta-wwm-ext-large |
albert |
albert-tiny , albert-xxlarge |
zen |
zen |
Running shell files: base examples/run_{task}.sh predict
, and the contents of shell files are as follows:
DATA_DIR="CBLUEDatasets"
TASK_NAME="qqr"
MODEL_TYPE="bert"
MODEL_DIR="data/model_data"
MODEL_NAME="chinese-bert-wwm"
OUTPUT_DIR="data/output"
RESULT_OUTPUT_DIR="data/result_output"
MAX_LENGTH=128
python baselines/run_classifier.py \
--data_dir=${DATA_DIR} \
--model_type=${MODEL_TYPE} \
--model_name=${MODEL_NAME} \
--model_dir=${MODEL_DIR} \
--task_name=${TASK_NAME} \
--output_dir=${OUTPUT_DIR} \
--result_output_dir=${RESULT_OUTPUT_DIR} \
--do_predict \
--max_length=${MAX_LENGTH} \
--eval_batch_size=16 \
--seed=2021
Notice: the result of prediction {TASK_NAME}_test.json
will be generated in RESULT_OUTPUT_DIR
.
Before you submit the predicted test files, you could check the format of test files using format_checker
and avoid the invalid evalution score induced by the format errors.
format_checker
, and rename as {taskname}_test_raw.[json|jsonl|tsv].
# take the CMeEE task for example:
cp ${path_to_CMeEE}/CMeEE_test.json ${current_dir}/CMeEE_test_raw.json
python3 format_checker_${taskname}.py {taskname}_test_raw.[json|jsonl|tsv] {taskname}_test.[json|jsonl|tsv]
python3 format_checker_CMeEE.py CMeEE_test_raw.json CMeEE_test.json
#### What is special?
##### IMCS-NER & IMCS-V2-NER tasks:
* Step1: Copy both the original test file(without answer) IMCS-NER_test.json(IMCS-V2-NER_test.json) and the IMCS_test.json(IMCS-V2_test.json) to this directory, and rename as IMCS-NER_test_raw.json(IMCS-V2-NER_test_raw.json)
```shell
# for IMCS-NER task:
cp ${path_to_IMCS-NER}/IMCS-NER_test.json ${current_dir}/IMCS-NER_test_raw.json
cp ${path_to_IMCS-NER}/IMCS_test.json ${current_dir}
# for IMCS-V2-NER task:
cp ${path_to_IMCS-V2-NER}/IMCS-V2-NER_test.json ${current_dir}/IMCS-V2-NER_test_raw.json
cp ${path_to_IMCS-V2-NER}/IMCS-V2_test.json ${current_dir}
# for IMCS-NER task:
python3 format_checker_IMCS_V1_NER.py IMCS-NER_test_raw.json IMCS-NER_test.json IMCS_test.json
# for IMCS-V2-NER task:
python3 format_checker_IMCS_V2_NER.py IMCS-V2-NER_test_raw.json IMCS-V2-NER_test.json IMCS-V2_test.json
If you want to implement the optional check login in the check_format function, which is commented in the master branch. You need also copy the normalized dictionary files to the current dir.
Compressing RESULT_OUTPUT_DIR
as .zip
file and submitting the file, you will get the score of evaluation on these biomedical NLU tasks, and your ranking!
For promoting the development and the application of language model in the biomedical field, we collect data from real-world biomedical scenarios and release the eight biomedical NLU (natural language understanding) tasks, including information extraction from the medical text (named entity recognition, relation extraction), normalization of the medical term, medical text classification, medical sentence similarity estimation and medical QA.
Dataset | Task | Train | Dev | Test | Evaluation Metrics |
---|---|---|---|---|---|
CMeEE | NER | 15,000 | 5,000 | 3,000 | Micro F1 |
CMeIE | Relation Extraction | 14,339 | 3,585 | 4,482 | Micro F1 |
CHIP-CDN | Diagnosis Normalization | 6,000 | 2,000 | 10,192 | Micro F1 |
CHIP-STS | Sentence Similarity | 16,000 | 4,000 | 10,000 | Macro F1 |
CHIP-CTC | Sentence Classification | 22,962 | 7,682 | 10,000 | Macro F1 |
KUAKE-QIC | Sentence Classification | 6,931 | 1,955 | 1,944 | Accuracy |
KUAKE-QTR | NLI | 24,174 | 2,913 | 5,465 | Accuracy |
KUAKE-QQR | NLI | 15,000 | 1,600 | 1,596 | Accuracy |
The evaluation task is the recognition of the named entity on the medical text. Given schema data and medical sentences, models are expected to extract entity about clinical information and classify these entities exactly.
The evaluation task is the extraction of entity relation on the medical text. Given schema and medical sentences, models are expected to automatically extract triples=[(S1, P1, O1), (S2, P2, O2)…] satisfying the constraint of schema. The schema defines the category of the predicate and corresponding subject and object, e.g.
(“subject_type”:“疾病”,“predicate”: “药物治疗”,“object_type”:“药物”) (“subject_type”:“疾病”,“predicate”: “实验室检查”,“object_type”:“检查”)
The evaluation task is the normalization of the diagnosis entity from the Chinese medical record. Given a diagnosis entity, models are expected to return corresponding standard terms.
In this evaluation task, given 44 semantic categories of screening standard (more detail in category.xlsx
) and some description about Chinese clinical screening standard, models are expected to return every description's specific category.
In this evaluation task, given pairs of sentences involving five different diseases, models are expected to judge the semantic similarity of the pair of sentences.
In this evaluation task, given a medical query, models are expected to classify the intention of patients. These medical queries have 11 categories: diagnosis
, cause
, method
, advice
, metric explain
, disease expression
, result
, attention
, effect
, price
, other
.
In this evaluation task, given a pair of query and title, models are expected to predict whether the topic of the pair query and title is consistent and the extent of their consistency.
In this evaluation task, given a pair of queries, models are expected to predict the extent of similarity between them.
The modules of Data Processor
, Model trainer
could be found in cblue/
. You can easily construct your code, train and evaluate your own models and methods. The corresponding Data Processor
, Dataset
, Trainer
of eight tasks are listed below:
Task | Data Processor (cblue.data) | Dataset (cblue.data) | Trainer (cblue.trainer) |
---|---|---|---|
CMeEE | EEDataProcessor |
EEDataset |
EETrainer |
CMeIE | ERDataProcessor /REDataProcessor |
ERDataset /REDataset |
ERTrainer /RETrainer |
CHIP-CDN | CDNDataProcessor |
CDNDataset |
CDNForCLSTrainer /CDNForNUMTrainer |
CHIP-CTC | CTCDataProcessor |
CTCDataset |
CTCTrainer |
CHIP-STS | STSDataProcessor |
STSDataset |
STSTrainer |
KUAKE-QIC | QICDataProcessor |
QICDataset |
QICTrainer |
KUAKE-QQR | QQRDataProcessor |
QQRDataset |
QQRTrainer |
KUAKE-QTR | QTRDataProcessor |
QTRDataset |
QTRTrainer |
Example for CMeEE
from cblue.data import EEDataProcessor, EEDataset
from cblue.trainer import EETrainer
from cblue.metrics import ee_metric, ee_commit_prediction
# get samples
data_processor = EEDataProcessor(root=...)
train_samples = data_processor.get_train_sample()
eval_samples = data_processor.get_dev_sample()
test_samples = data_processor,get_test_sample()
# 'torch.Dataset'
train_dataset = EEDataset(train_sample, tokenizer=..., mode='train', max_length=...)
# training model
trainer = EETrainer(...)
trainer.train(...)
# predicton and generation of result
test_dataset = EEDataset(test_sample, tokenizer=..., mode='test', max_length=...)
trainer.predict(test_dataset)
We list the hyper-parameters of every tasks during the baseline experiments.
Common hyper-parameters
Param | Value |
---|---|
warmup_proportion | 0.1 |
weight_decay | 0.01 |
adam_epsilon | 1e-8 |
max_grad_norm | 1.0 |
CMeEE
Hyper-parameters for the training of pre-trained models with a token classification head on top for named entity recognition of the CMeEE task.
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 5 | 32 | 128 | 4e-5 |
bert-wwm-ext | 5 | 32 | 128 | 4e-5 |
roberta-wwm-ext | 5 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 5 | 12 | 65 | 2e-5 |
roberta-large | 5 | 12 | 65 | 2e-5 |
albert-tiny | 10 | 32 | 128 | 5e-5 |
albert-xxlarge | 5 | 12 | 65 | 1e-5 |
zen | 5 | 20 | 128 | 4e-5 |
macbert-base | 5 | 32 | 128 | 4e-5 |
macbert-large | 5 | 12 | 80 | 2e-5 |
PCL-MedBERT | 5 | 32 | 128 | 4e-5 |
CMeIE-ER
Hyper-parameters for the training of pre-trained models with a token-level classifier for subject and object recognition of the CMeIE task.
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 7 | 32 | 128 | 5e-5 |
bert-wwm-ext | 7 | 32 | 128 | 5e-5 |
roberta-wwm-ext | 7 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 7 | 16 | 80 | 4e-5 |
roberta-large | 7 | 16 | 80 | 2e-5 |
albert-tiny | 10 | 32 | 128 | 4e-5 |
albert-xxlarge | 7 | 16 | 80 | 1e-5 |
zen | 7 | 20 | 128 | 4e-5 |
macbert-base | 7 | 32 | 128 | 4e-5 |
macbert-large | 7 | 20 | 80 | 2e-5 |
PCL-MedBERT | 7 | 32 | 128 | 4e-5 |
CMeIE-RE
Hyper-parameters for the training of pre-trained models with a classifier for the entity pairs relation prediction of the CMeIE task.
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 8 | 32 | 128 | 5e-5 |
bert-wwm-ext | 8 | 32 | 128 | 5e-5 |
roberta-wwm-ext | 8 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 8 | 16 | 80 | 4e-5 |
roberta-large | 8 | 16 | 80 | 2e-5 |
albert-tiny | 10 | 32 | 128 | 4e-5 |
albert-xxlarge | 8 | 16 | 80 | 1e-5 |
zen | 8 | 20 | 128 | 4e-5 |
macbert-base | 8 | 32 | 128 | 4e-5 |
macbert-large | 8 | 20 | 80 | 2e-5 |
PCL-MedBERT | 8 | 32 | 128 | 4e-5 |
CHIP-CTC
Hyper-parameters for the training of pre-trained models with a sequence classification head on top for screening criteria classification of the CHIP-CTC task.
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 5 | 32 | 128 | 5e-5 |
bert-wwm-ext | 5 | 32 | 128 | 5e-5 |
roberta-wwm-ext | 5 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 5 | 32 | 50 | 3e-5 |
roberta-large | 5 | 24 | 50 | 2e-5 |
albert-tiny | 10 | 32 | 128 | 4e-5 |
albert-xxlarge | 5 | 20 | 50 | 1e-5 |
zen | 5 | 20 | 128 | 4e-5 |
macbert-base | 5 | 32 | 128 | 4e-5 |
macbert-large | 5 | 20 | 50 | 2e-5 |
PCL-MedBERT | 5 | 32 | 128 | 4e-5 |
CHIP-CDN-cls
Hyper-parameters for the CHIP-CDN task. We model the CHIP-CDN task with two stages: recall stage and ranking stage. num_negative_sample
sets the number of negative samples sampled for the training ranking model during the ranking stage. recall_k
sets the number of candidates recalled in the recall stage.
Param | Value |
---|---|
recall_k | 200 |
num_negative_sample | 5+5(random) |
Hyper-parameters for the training of pre-trained models with a sequence classifier for the ranking model of the CHIP-CDN task. We encode the pairs of the original term and standard phrase from candidates recalled during the recall stage and then pass the pooled output to the classifier, which predicts the relevance between the original term and standard phrase.
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 3 | 32 | 128 | 4e-5 |
bert-wwm-ext | 3 | 32 | 128 | 5e-5 |
roberta-wwm-ext | 3 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 3 | 32 | 40 | 4e-5 |
roberta-large | 3 | 32 | 40 | 4e-5 |
albert-tiny | 3 | 32 | 128 | 4e-5 |
albert-xxlarge | 3 | 32 | 40 | 1e-5 |
zen | 3 | 20 | 128 | 4e-5 |
macbert-base | 3 | 32 | 128 | 4e-5 |
macbert-large | 3 | 32 | 40 | 2e-5 |
PCL-MedBERT | 3 | 32 | 128 | 4e-5 |
CHIP-CDN-num
Hyper-parameters for the training of pre-trained models with a sequence classifier for the prediction of the number of standard phrases corresponding to the original term in the CHIP-CDN task. We take the prediction results of the model as the number we choose from the most relevant standard phrases, combining with the prediction of the ranking model.
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 20 | 32 | 128 | 4e-5 |
bert-wwm-ext | 20 | 32 | 128 | 5e-5 |
roberta-wwm-ext | 20 | 32 | 128 | 4e-5 |
roberta-wwm-ext-large | 20 | 12 | 40 | 4e-5 |
roberta-large | 20 | 12 | 40 | 4e-5 |
albert-tiny | 20 | 32 | 128 | 4e-5 |
albert-xxlarge | 20 | 12 | 40 | 1e-5 |
zen | 20 | 20 | 128 | 4e-5 |
macbert-base | 20 | 32 | 128 | 4e-5 |
macbert-large | 20 | 12 | 40 | 2e-5 |
PCL-MedBERT | 20 | 32 | 128 | 4e-5 |
CHIP-STS
Hyper-parameters for the training of pre-trained models with a sequence classifier for sentence similarity predication of the CHIP-STS task.
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 3 | 16 | 40 | 3e-5 |
bert-wwm-ext | 3 | 16 | 40 | 3e-5 |
roberta-wwm-ext | 3 | 16 | 40 | 4e-5 |
roberta-wwm-ext-large | 3 | 16 | 40 | 4e-5 |
roberta-large | 3 | 16 | 40 | 2e-5 |
albert-tiny | 3 | 16 | 40 | 5e-5 |
albert-xxlarge | 3 | 16 | 40 | 1e-5 |
zen | 3 | 16 | 40 | 2e-5 |
macbert-base | 3 | 16 | 40 | 3e-5 |
macbert-large | 3 | 16 | 40 | 3e-5 |
PCL-MedBERT | 3 | 16 | 40 | 2e-5 |
KUAKE-QIC
Hyper-parameters for the training of pre-trained models with a sequence classifier for query intention prediction of the KUAKE-QIC task.
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 3 | 16 | 50 | 2e-5 |
bert-wwm-ext | 3 | 16 | 50 | 2e-5 |
roberta-wwm-ext | 3 | 16 | 50 | 2e-5 |
roberta-wwm-ext-large | 3 | 16 | 50 | 2e-5 |
roberta-large | 3 | 16 | 50 | 3e-5 |
albert-tiny | 3 | 16 | 50 | 5e-5 |
albert-xxlarge | 3 | 16 | 50 | 1e-5 |
zen | 3 | 16 | 50 | 2e-5 |
macbert-base | 3 | 16 | 50 | 3e-5 |
macbert-large | 3 | 16 | 50 | 2e-5 |
PCL-MedBERT | 3 | 16 | 50 | 2e-5 |
KUAKE-QTR
Hyper-parameters for the training of pre-trained models with a sequence classifier for query-title pairs relevance prediction of the KUAKE-QTR task.
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 3 | 16 | 40 | 4e-5 |
bert-wwm-ext | 3 | 16 | 40 | 2e-5 |
roberta-wwm-ext | 3 | 16 | 40 | 3e-5 |
roberta-wwm-ext-large | 3 | 16 | 40 | 2e-5 |
roberta-large | 3 | 16 | 40 | 2e-5 |
albert-tiny | 3 | 16 | 40 | 5e-5 |
albert-xxlarge | 3 | 16 | 40 | 1e-5 |
zen | 3 | 16 | 40 | 3e-5 |
macbert-base | 3 | 16 | 40 | 2e-5 |
macbert-large | 3 | 16 | 40 | 2e-5 |
PCL-MedBERT | 3 | 16 | 40 | 3e-5 |
KUAKE-QQR
Hyper-parameters for the training of pre-trained models with a sequence classifier for query-query pairs relevance prediction of the KUAKE-QQR task.
Model | epoch | batch_size | max_length | learning_rate |
---|---|---|---|---|
bert-base | 3 | 16 | 30 | 3e-5 |
bert-wwm-ext | 3 | 16 | 30 | 3e-5 |
roberta-wwm-ext | 3 | 16 | 30 | 3e-5 |
roberta-wwm-ext-large | 3 | 16 | 30 | 3e-5 |
roberta-large | 3 | 16 | 30 | 2e-5 |
albert-tiny | 3 | 16 | 30 | 5e-5 |
albert-xxlarge | 3 | 16 | 30 | 3e-5 |
zen | 3 | 16 | 30 | 2e-5 |
macbert-base | 3 | 16 | 30 | 2e-5 |
macbert-large | 3 | 16 | 30 | 2e-5 |
PCL-MedBERT | 3 | 16 | 30 | 2e-5 |
@inproceedings{zhang-etal-2022-cblue,
title = "{CBLUE}: A {C}hinese Biomedical Language Understanding Evaluation Benchmark",
author = "Zhang, Ningyu and
Chen, Mosha and
Bi, Zhen and
Liang, Xiaozhuan and
Li, Lei and
Shang, Xin and
Yin, Kangping and
Tan, Chuanqi and
Xu, Jian and
Huang, Fei and
Si, Luo and
Ni, Yuan and
Xie, Guotong and
Sui, Zhifang and
Chang, Baobao and
Zong, Hui and
Yuan, Zheng and
Li, Linfeng and
Yan, Jun and
Zan, Hongying and
Zhang, Kunli and
Tang, Buzhou and
Chen, Qingcai",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-long.544",
pages = "7888--7915",
abstract = "Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually offering great promise for medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.",
}
[1] CLUE: A Chinese Language Understanding Evaluation Benchmark [pdf] [git] [web]
[2] GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding [pdf] [web]
[3] SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems [pdf] [web]