zhconv-rs converts Chinese text among traditional/simplified scripts or regional variants (e.g. zh-TW <-> zh-CN <-> zh-HK <-> zh-Hans <-> zh-Hant
), built on the top of rulesets from MediaWiki/Wikipedia and OpenCC.
The implementation is powered by the Aho-Corasick algorithm, ensuring linear time complexity with respect to the length of input text and conversion rules (O(n+m)
), processing dozens of MiBs text per second.
🔗 Web App: https://zhconv.pages.dev (powered by WASM)
⚙️ Cli: cargo install zhconv-cli
or check releases.
🦀 Rust Crate: cargo add zhconv
(check docs for examples)
🐍 Python Package via PyO3: pip install zhconv-rs
(WASM with wheels)
JS (Webpack): npm install zhconv
or yarn add zhconv
(WASM, instructions)
JS in browser: https://cdn.jsdelivr.net/npm/zhconv-web@latest (WASM)
cargo bench
on Intel(R) Xeon(R) CPU @ 2.80GHz
(GitPod) by v0.2, without parsing inline conversion rules:
load zh2Hant time: [45.442 ms 45.946 ms 46.459 ms]
load zh2Hans time: [8.1378 ms 8.3787 ms 8.6414 ms]
load zh2TW time: [60.209 ms 61.261 ms 62.407 ms]
load zh2HK time: [89.457 ms 90.847 ms 92.297 ms]
load zh2MO time: [96.670 ms 98.063 ms 99.586 ms]
load zh2CN time: [27.850 ms 28.520 ms 29.240 ms]
load zh2SG time: [28.175 ms 28.963 ms 29.796 ms]
load zh2MY time: [27.142 ms 27.635 ms 28.143 ms]
zh2TW data54k time: [546.10 us 553.14 us 561.24 us]
zh2CN data54k time: [504.34 us 511.22 us 518.59 us]
zh2Hant data689k time: [3.4375 ms 3.5182 ms 3.6013 ms]
zh2TW data689k time: [3.6062 ms 3.6784 ms 3.7545 ms]
zh2Hant data3185k time: [62.457 ms 64.257 ms 66.099 ms]
zh2TW data3185k time: [60.217 ms 61.348 ms 62.556 ms]
zh2TW data55m time: [1.0773 s 1.0872 s 1.0976 s]
The benchmark was performed on a previous version that had only Mediawiki rulesets. In the newer version, with OpenCC rulesets activated by default, the performance may degrade ~2x. Since v0.3, the Aho-Corasick algorithm implementation has been switched to daachorse with automaton prebuilt during compile time. The performance is no worse than the previous version, even though OpenCC rulesets are newly introduced.
Be noted that, OpenCC rulesets accounts for at least several MiBs in build output. If that looks too big, just overwrite the default features (e.g. zhconv = { version = "...", features = [ "compress" ] }
).
ZhConver{sion,ter}.php
of MediaWiki: zhconv-rs just takes conversion tables listed in ZhConversion.php
. MediaWiki relies on the inefficient PHP built-in function strtr
. Under the basic mode, zhconv-rs guarantees linear time complexity (T = O(n+m)
instead of O(nm)
) and single-pass scanning of input text. Optionally, zhconv-rs supports the same conversion rule syntax with MediaWiki.strtr
. However, OpenCC supports pre-segmentation and maintains multiple rulesets which are applied successively. By contrast, the Aho-Corasick-powered zhconv-rs merges rulesets from MediaWiki and OpenCC in compile time and converts text in single-pass linear time, resulting in much more efficiency. Though, conversion results may differ in some cases.The converter takes leftmost-longest matching strategy. It gives priority to the leftmost-matched words or phrases. For instance, if a ruleset includes both 干 -> 幹
and 天干物燥 -> 天乾物燥
, the converter would prioritize 天乾物燥
because 天干物燥
gets matched earlier compared to 干
at a later position. The strategy yields good results in general, but may occasionally lead to wrong conversions.
The implementation support most of the MediaWiki conversion rules. But it is not fully compliant with the original implementation.
Besides, for wikitext, if input text contains global conversion rules (in MediaWiki syntax like -{H|zh-hans:鹿|zh-hant:马}-
), the time complexity of the implementation may degrade to O(n*m)
where n
and m
are the length of the text and the maximum lengths of sources words in conversion rulesets in the worst case (equivalent to brute-force).
All rulesets that power the converter come from the MediaWiki project and OpenCC.
The project takes the following projects/pages as references:
zhConver{ter,sion}.php
.