This project provides ancient Chinese models to NLP tasks including language modeling, word segmentation and part-of-speech tagging.
Our paper has been accepted to ROCLING! Please check out our paper.
We uploaded our models to HuggingFace hub.
The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.
pip install transformers==4.15.0
pip install torch==1.10.2
Pre-trained Language Model
You can use ckiplab/bert-base-han-chinese directly with a pipeline for masked language modeling.
from transformers import pipeline
# Initialize
unmasker = pipeline('fill-mask', model='ckiplab/bert-base-han-chinese')
# Input text with [MASK]
unmasker("黎[MASK]於變時雍。")
# output
[{'sequence': '黎 民 於 變 時 雍 。',
'score': 0.14885780215263367,
'token': 3696,
'token_str': '民'},
{'sequence': '黎 庶 於 變 時 雍 。',
'score': 0.0859643816947937,
'token': 2433,
'token_str': '庶'},
{'sequence': '黎 氏 於 變 時 雍 。',
'score': 0.027848130092024803,
'token': 3694,
'token_str': '氏'},
{'sequence': '黎 人 於 變 時 雍 。',
'score': 0.023678112775087357,
'token': 782,
'token_str': '人'},
{'sequence': '黎 生 於 變 時 雍 。',
'score': 0.018718384206295013,
'token': 4495,
'token_str': '生'}]
You can use ckiplab/bert-base-han-chinese to get the features of a given text in PyTorch.
from transformers import AutoTokenizer, AutoModel
# Initialize tokenzier and model
tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese")
model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese")
# Input text
text = "黎民於變時雍。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
# get encoded token vectors
output.last_hidden_state # torch.Tensor with Size([1, 9, 768])
# get encoded sentence vector
output.pooler_output # torch.Tensor with Size([1, 768])
Word Segmentation (WS)
In WS, ckiplab/bert-base-han-chinese-ws divides written the text into meaningful units - words. The task is formulated as labeling each word with either beginning (B) or inside (I).
from transformers import pipeline
# Initialize
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")
# Input text
classifier("帝堯曰放勳")
# output
[{'entity': 'B',
'score': 0.9999793,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'I',
'score': 0.9915047,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'B',
'score': 0.99992275,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'B',
'score': 0.99905187,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'I',
'score': 0.96299917,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]
Part-of-Speech (PoS) Tagging
In PoS tagging, ckiplab/bert-base-han-chinese-pos recognizes parts of speech in a given text. The task is formulated as labeling each word with a part of the speech.
from transformers import pipeline
# Initialize
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-pos")
# Input text
classifier("帝堯曰放勳")
# output
[{'entity': 'NB1',
'score': 0.99410427,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'NB1',
'score': 0.98874336,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'VG',
'score': 0.97059363,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'NB1',
'score': 0.9864504,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'NB1',
'score': 0.9543974,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]
Language Model | MLM Training Data | MLM Testing Data | |||
---|---|---|---|---|---|
上古 | 中古 | 近代 | 現代 | ||
ckiplab/bert-base-han-Chinese | 上古 | 24.7588 | 87.8176 | 297.1111 | 60.3993 |
中古 | 67.861 | 70.6244 | 133.0536 | 23.0125 | |
近代 | 69.1364 | 77.4154 | 46.8308 | 20.4289 | |
現代 | 118.8596 | 163.6896 | 146.5959 | 4.6143 | |
Merge | 31.1807 | 61.2381 | 49.0672 | 4.5017 | |
ckiplab/bert-base-chinese | - | 233.6394 | 405.9008 | 278.7069 | 8.8521 |
WS Model | Training Data | Testing Data | |||
---|---|---|---|---|---|
上古 | 中古 | 近代 | 現代 | ||
ckiplab/bert-base-han-chinese-ws | 上古 | 97.6090 | 88.5734 | 83.2877 | 70.3772 |
中古 | 92.6402 | 92.6538 | 89.4803 | 78.3827 | |
近代 | 90.8651 | 92.1861 | 94.6495 | 81.2143 | |
現代 | 87.0234 | 83.5810 | 84.9370 | 96.9446 | |
Merge | 97.4537 | 91.9990 | 94.0970 | 96.7314 | |
ckiplab/bert-base-chinese-ws | - | 86.5698 | 82.9115 | 84.3213 | 98.1325 |
POS Model | Training Data | Testing Data | |||
---|---|---|---|---|---|
上古 | 中古 | 近代 | 現代 | ||
ckiplab/bert-base-han-chinese-pos | 上古 | 91.2945 | - | - | - |
中古 | 7.3662 | 80.4896 | 11.3371 | 10.2577 | |
近代 | 6.4794 | 14.3653 | 88.6580 | 0.5316 | |
現代 | 11.9895 | 11.0775 | 0.4033 | 93.2813 | |
Merge | 88.8772 | 42.4369 | 86.9093 | 92.9012 |
Copyright (c) 2022 CKIP Lab under the GPL-3.0 License.
Please cite our paper if you use Han-Transformers in your work:
@inproceedings{lin-ma-2022-hantrans,
title = "{H}an{T}rans: An Empirical Study on Cross-Era Transferability of {C}hinese Pre-trained Language Model",
author = "Lin, Chin-Tung and Ma, Wei-Yun",
booktitle = "Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)",
year = "2022",
address = "Taipei, Taiwan",
publisher = "The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)",
url = "https://aclanthology.org/2022.rocling-1.21",
pages = "164--173",
}