ckiplab / han-transformers

7 stars 1 forks source link

Han Transformers

This project provides ancient Chinese models to NLP tasks including language modeling, word segmentation and part-of-speech tagging.

Our paper has been accepted to ROCLING! Please check out our paper.

Dependency

Models

We uploaded our models to HuggingFace hub.

Training Corpus

The copyright of the datasets belongs to the Institute of Linguistics, Academia Sinica.

Usage

Installation

pip install transformers==4.15.0
pip install torch==1.10.2

Inference

Model Performance

Pre-trained Language Model, Perplexity ↓

Language Model MLM Training Data MLM Testing Data
上古 中古 近代 現代
ckiplab/bert-base-han-Chinese 上古 24.7588 87.8176 297.1111 60.3993
中古 67.861 70.6244 133.0536 23.0125
近代 69.1364 77.4154 46.8308 20.4289
現代 118.8596 163.6896 146.5959 4.6143
Merge 31.1807 61.2381 49.0672 4.5017
ckiplab/bert-base-chinese - 233.6394 405.9008 278.7069 8.8521

Word Segmentation (WS), F1 score (%) ↑

WS Model Training Data Testing Data
上古 中古 近代 現代
ckiplab/bert-base-han-chinese-ws 上古 97.6090 88.5734 83.2877 70.3772
中古 92.6402 92.6538 89.4803 78.3827
近代 90.8651 92.1861 94.6495 81.2143
現代 87.0234 83.5810 84.9370 96.9446
Merge 97.4537 91.9990 94.0970 96.7314
ckiplab/bert-base-chinese-ws - 86.5698 82.9115 84.3213 98.1325

Part-of-Speech (POS) Tagging, F1 score (%) ↑

POS Model Training Data Testing Data
上古 中古 近代 現代
ckiplab/bert-base-han-chinese-pos 上古 91.2945 - - -
中古 7.3662 80.4896 11.3371 10.2577
近代 6.4794 14.3653 88.6580 0.5316
現代 11.9895 11.0775 0.4033 93.2813
Merge 88.8772 42.4369 86.9093 92.9012

License

Copyright (c) 2022 CKIP Lab under the GPL-3.0 License.

Citation

Please cite our paper if you use Han-Transformers in your work:

@inproceedings{lin-ma-2022-hantrans,
    title = "{H}an{T}rans: An Empirical Study on Cross-Era Transferability of {C}hinese Pre-trained Language Model",
    author = "Lin, Chin-Tung  and  Ma, Wei-Yun",
    booktitle = "Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)",
    year = "2022",
    address = "Taipei, Taiwan",
    publisher = "The Association for Computational Linguistics and Chinese Language Processing (ACLCLP)",
    url = "https://aclanthology.org/2022.rocling-1.21",
    pages = "164--173",
}