Han Transformers

This project provides ancient Chinese models to NLP tasks including language modeling, word segmentation and part-of-speech tagging.

Our paper has been accepted to ROCLING! Please check out our paper.

Dependency

transformers ≤ 4.15.0
pytorch

Models

We uploaded our models to HuggingFace hub.

Pretrained models using a masked language modeling (MLM) objective.
- ckiplab/bert-base-han-chinese
Fine-tuned models for Word Segmentation.
- ckiplab/bert-base-han-chinese-ws (Merge)
- ckiplab/bert-base-han-chinese-ws-shanggu (上古)
- ckiplab/bert-base-han-chinese-ws-zhonggu (中古)
- ckiplab/bert-base-han-chinese-ws-jindai (近代)
- ckiplab/bert-base-han-chinese-ws-xiandai (現代)
Fine-tuned models for Part-of-Speech tagging.
- ckiplab/bert-base-han-chinese-pos (Merge)
- ckiplab/bert-base-han-chinese-pos-shanggu (上古 / 標記列表)
- ckiplab/bert-base-han-chinese-pos-zhonggu (中古 / 標記列表)
- ckiplab/bert-base-han-chinese-pos-jindai (近代 / 標記列表)
- ckiplab/bert-base-han-chinese-pos-xiandai (現代 / 標記列表)

Training Corpus

The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.

Usage

Installation

pip install transformers==4.15.0
pip install torch==1.10.2

Inference

Pre-trained Language Model

You can use ckiplab/bert-base-han-chinese directly with a pipeline for masked language modeling.

from transformers import pipeline

# Initialize 
unmasker = pipeline('fill-mask', model='ckiplab/bert-base-han-chinese')

# Input text with [MASK]
unmasker("黎[MASK]於變時雍。")

# output
[{'sequence': '黎 民 於 變 時 雍 。',
'score': 0.14885780215263367,
'token': 3696,
'token_str': '民'},
{'sequence': '黎 庶 於 變 時 雍 。',
'score': 0.0859643816947937,
'token': 2433,
'token_str': '庶'},
{'sequence': '黎 氏 於 變 時 雍 。',
'score': 0.027848130092024803,
'token': 3694,
'token_str': '氏'},
{'sequence': '黎 人 於 變 時 雍 。',
'score': 0.023678112775087357,
'token': 782,
'token_str': '人'},
{'sequence': '黎 生 於 變 時 雍 。',
'score': 0.018718384206295013,
'token': 4495,
'token_str': '生'}]

You can use ckiplab/bert-base-han-chinese to get the features of a given text in PyTorch.

from transformers import AutoTokenizer, AutoModel

# Initialize tokenzier and model
tokenizer = AutoTokenizer.from_pretrained("ckiplab/bert-base-han-chinese")
model = AutoModel.from_pretrained("ckiplab/bert-base-han-chinese")

# Input text
text = "黎民於變時雍。"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# get encoded token vectors
output.last_hidden_state    # torch.Tensor with Size([1, 9, 768])

# get encoded sentence vector
output.pooler_output        # torch.Tensor with Size([1, 768])

Word Segmentation (WS)

In WS, ckiplab/bert-base-han-chinese-ws divides written the text into meaningful units - words. The task is formulated as labeling each word with either beginning (B) or inside (I).

from transformers import pipeline

# Initialize
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-ws")

# Input text
classifier("帝堯曰放勳")

# output
[{'entity': 'B',
'score': 0.9999793,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'I',
'score': 0.9915047,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'B',
'score': 0.99992275,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'B',
'score': 0.99905187,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'I',
'score': 0.96299917,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]

Part-of-Speech (PoS) Tagging

In PoS tagging, ckiplab/bert-base-han-chinese-pos recognizes parts of speech in a given text. The task is formulated as labeling each word with a part of the speech.

from transformers import pipeline

# Initialize
classifier = pipeline("token-classification", model="ckiplab/bert-base-han-chinese-pos")

# Input text
classifier("帝堯曰放勳")

# output
[{'entity': 'NB1',
'score': 0.99410427,
'index': 1,
'word': '帝',
'start': 0,
'end': 1},
{'entity': 'NB1',
'score': 0.98874336,
'index': 2,
'word': '堯',
'start': 1,
'end': 2},
{'entity': 'VG',
'score': 0.97059363,
'index': 3,
'word': '曰',
'start': 2,
'end': 3},
{'entity': 'NB1',
'score': 0.9864504,
'index': 4,
'word': '放',
'start': 3,
'end': 4},
{'entity': 'NB1',
'score': 0.9543974,
'index': 5,
'word': '勳',
'start': 4,
'end': 5}]

Model Performance

Pre-trained Language Model, Perplexity ↓

Language Model	MLM Training Data	MLM Testing Data
Language Model	MLM Training Data	上古	中古	近代	現代
ckiplab/bert-base-han-Chinese	上古	24.7588	87.8176	297.1111	60.3993
	中古	67.861	70.6244	133.0536	23.0125
	近代	69.1364	77.4154	46.8308	20.4289
	現代	118.8596	163.6896	146.5959	4.6143
	Merge	31.1807	61.2381	49.0672	4.5017
ckiplab/bert-base-chinese	-	233.6394	405.9008	278.7069	8.8521

Word Segmentation (WS), F1 score (%) ↑

WS Model	Training Data	Testing Data
WS Model	Training Data	上古	中古	近代	現代
ckiplab/bert-base-han-chinese-ws	上古	97.6090	88.5734	83.2877	70.3772
	中古	92.6402	92.6538	89.4803	78.3827
	近代	90.8651	92.1861	94.6495	81.2143
	現代	87.0234	83.5810	84.9370	96.9446
	Merge	97.4537	91.9990	94.0970	96.7314
ckiplab/bert-base-chinese-ws	-	86.5698	82.9115	84.3213	98.1325

Part-of-Speech (POS) Tagging, F1 score (%) ↑

POS Model	Training Data	Testing Data
POS Model	Training Data	上古	中古	近代	現代
ckiplab/bert-base-han-chinese-pos	上古	91.2945	-	-	-
	中古	7.3662	80.4896	11.3371	10.2577
	近代	6.4794	14.3653	88.6580	0.5316
	現代	11.9895	11.0775	0.4033	93.2813
	Merge	88.8772	42.4369	86.9093	92.9012

License

Citation

Please cite our paper if you use Han-Transformers in your work:

@inproceedings{lin-ma-2022-hantrans,
    title = "{H}an{T}rans: An Empirical Study on Cross-Era Transferability of {C}hinese Pre-trained Language Model",
    author = "Lin, Chin-Tung  and  Ma, Wei-Yun",
    booktitle = "Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)",
    year = "2022",
    address = "Taipei, Taiwan",
    publisher = "The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)",
    url = "https://aclanthology.org/2022.rocling-1.21",
    pages = "164--173",
}

ckiplab / han-transformers

readme