OFA-Sys / OFA

Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
Apache License 2.0
2.39k stars 248 forks source link

Any experience to get the vocabulary of a specific language (e.g., Chinese)? #380

Open yangbang18 opened 1 year ago

yangbang18 commented 1 year ago

Hi, I saw you shared a Chinese-version OFA that uses a vocabulary of roughly 20K WordPieces.

I wonder how you get this vocabulary.

Did you extract it from the existing vocab of a pre-trained language model (e.g., from bert-base-multilingual-cased) ?

Or, did you just build and obtain it on a large-scale Chinese corpus?

Looking forward to your reply.