cl-tohoku / bert-japanese

BERT models for Japanese text.
Apache License 2.0
514 stars 55 forks source link

Pretrained Japanese BERT models

This is a repository of pretrained Japanese BERT models. The models are available in Transformers by Hugging Face.

This version of README contains information for the following models:

For information and codes for the following models, refer to the v2.0 tag of this repository:

For information and codes for the following models, refer to the v1.0 tag of this repository:

Model Architecture

The architecture of our models are the same as the original BERT models proposed by Google.

Training Data

The models are trained on the Japanese portion of CC-100 dataset and the Japanese version of Wikipedia. For Wikipedia, we generated a text corpus from the Wikipedia Cirrussearch dump file as of January 2, 2023.

The corpus files generated from CC-100 and Wikipedia are 74.3GB and 4.9GB in size and consist of approximately 392M and 34M sentences, respectively.

For the purpose of splitting texts into sentences, we used fugashi with mecab-ipadic-NEologd dictionary (v0.0.7).

Generating corpus files

# For CC-100
$ mkdir -p $WORK_DIR/corpus/cc-100
$ python merge_split_corpora.py \
--input_files $DATA_DIR/cc-100/ja.txt.xz \
--output_dir $WORK_DIR/corpus/cc-100 \
--num_files 64

# For Wikipedia
$ mkdir -p $WORK_DIR/corpus/wikipedia
$ python make_corpus_wiki.py \
--input_file $DATA_DIR/wikipedia/cirrussearch/20230102/jawiki-20230102-cirrussearch-content.json.gz \
--output_file $WORK_DIR/corpus/wikipedia/corpus.txt.gz \
--min_sentence_length 10 \
--max_sentence_length 200 \
--mecab_option '-r <path to etc/mecabrc> -d <path to lib/mecab/dic/mecab-ipadic-neologd>'
$ python merge_split_corpora.py \
--input_files $WORK_DIR/corpus/wikipedia/corpus.txt.gz \
--output_dir $WORK_DIR/corpus/wikipedia \
--num_files 8

# Sample 1M sentences for training tokenizers
$ cat $WORK_DIR/corpus/wikipedia/corpus_*.txt|grep -a -v '^$'|shuf|head -n 10000000 > $WORK_DIR/corpus/wikipedia/corpus_sampled.txt

Tokenization

For each of BERT-base and BERT-large, we provide two models with different tokenization methods.

We used unidic-lite dictionary for tokenization.

Generating a set of characters

$ mkdir -p $WORK_DIR/tokenizers/alphabet
$ python make_alphabet_from_unidic.py \
--lex_file $DATA_DIR/unidic-mecab-2.1.2_src/lex.csv \
--output_file $WORK_DIR/tokenizers/alphabet/unidic_lite.txt

Training tokenizers

# WordPiece
$ python train_tokenizer.py \
--input_files $WORK_DIR/corpus/wikipedia/corpus_sampled.txt \
--output_dir $WORK_DIR/tokenizers/wordpiece_unidic_lite \
--pre_tokenizer_type mecab \
--mecab_dic_type unidic_lite \
--vocab_size 32768 \
--limit_alphabet 7012 \
--initial_alphabet_file $WORK_DIR/tokenizers/alphabet/unidic_lite.txt \
--num_unused_tokens 10 \
--wordpieces_prefix '##'

# Character
$ mkdir $WORK_DIR/tokenizers/character_unidic_lite
$ head -n 7027 $WORK_DIR/tokenizers/wordpiece_unidic_lite/vocab.txt > $WORK_DIR/tokenizers/character_unidic_lite/vocab.txt

Generating pretraining data

# WordPiece on CC-100
# Each process takes about 2h50m and 60GB RAM, producing 15.2M instances
$ mkdir -p $WORK_DIR/pretraining_data/wordpiece_unidic_lite/cc-100
$ seq -f %02g 1 64|xargs -L 1 -I {} -P 2 \
python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/cc-100/corpus_{}.txt \
--output_file $WORK_DIR/pretraining_data/wordpiece_unidic_lite/cc-100/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/wordpiece_unidic_lite/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type wordpiece \
--mecab_dic_type unidic_lite \
--do_whole_word_mask \
--gzip_compress \
--use_v2_feature_names \
--max_seq_length 128 \
--max_predictions_per_seq 19 \
--masked_lm_prob 0.15 \
--dupe_factor 5

# WordPiece on Wikipedia
# Each process takes about 7h30m and 138GB RAM, producing 18.4M instances
$ mkdir -p $WORK_DIR/pretraining_data/wordpiece_unidic_lite/wikipedia
$ seq -f %02g 1 8|xargs -L 1 -I {} -P 1 \
python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/wikipedia/corpus_{}.txt \
--output_file $WORK_DIR/pretraining_data/wordpiece_unidic_lite/wikipedia/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/wordpiece_unidic_lite/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type wordpiece \
--mecab_dic_type unidic_lite \
--do_whole_word_mask \
--gzip_compress \
--use_v2_feature_names \
--max_seq_length 512 \
--max_predictions_per_seq 76 \
--masked_lm_prob 0.15 \
--dupe_factor 30

# Character on CC-100
# Each process takes about 3h30m and 82GB RAM, producing 18.4M instances
$ mkdir -p $WORK_DIR/pretraining_data/character_unidic_lite/cc-100
$ seq -f %02g 1 64|xargs -L 1 -I {} -P 2 \
python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/cc-100/corpus_{}.txt \
--output_file $WORK_DIR/pretraining_data/character_unidic_lite/cc-100/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/character_unidic_lite/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type character \
--mecab_dic_type unidic_lite \
--vocab_has_no_subword_prefix \
--do_whole_word_mask \
--gzip_compress \
--use_v2_feature_names \
--max_seq_length 128 \
--max_predictions_per_seq 19 \
--masked_lm_prob 0.15 \
--dupe_factor 5

# Character on Wikipedia
# Each process takes about 10h30m and 205GB RAM, producing 23.7M instances
$ mkdir -p $WORK_DIR/pretraining_data/character_unidic_lite/wikipedia
$ seq -f %02g 1 8|xargs -L 1 -I {} -P 1 \
python create_pretraining_data.py \
--input_file $WORK_DIR/corpus/wikipedia/corpus_{}.txt \
--output_file $WORK_DIR/pretraining_data/character_unidic_lite/wikipedia/pretraining_data_{}.tfrecord.gz \
--vocab_file $WORK_DIR/tokenizers/character_unidic_lite/vocab.txt \
--word_tokenizer_type mecab \
--subword_tokenizer_type character \
--mecab_dic_type unidic_lite \
--vocab_has_no_subword_prefix \
--do_whole_word_mask \
--gzip_compress \
--use_v2_feature_names \
--max_seq_length 512 \
--max_predictions_per_seq 76 \
--masked_lm_prob 0.15 \
--dupe_factor 30

Training

We trained the models first on the CC-100 corpus and then on the Wikipedia corpus. Generally speaking, the texts of Wikipedia are much cleaner than those of CC-100, but the amount of text is much smaller. We expect that our two-stage training scheme let the model trained on large amount of text while preserving the quality of language that the model eventually learns.

For training of the MLM (masked language modeling) objective, we introduced whole word masking in which all subword tokens corresponding to a single word (tokenized by MeCab) are masked at once.

To conduct training of each model, we used a v3-8 instance of Cloud TPUs provided by TensorFlow Research Cloud program. The training took about 16 and 56 days for BERT-base and BERT-large models, respectively.

Creating a TPU VM and connecting to it

Note: We set the runtime version of the TPU as 2.11.0, where TensorFlow v2.11 is used. It is important to specify the same version if you wish to reuse our codes, otherwise it may not work properly.

Here we use Google Cloud CLI.

$ gcloud compute tpus tpu-vm create <TPU_NODE_ID> --zone=<TPU_ZONE> --accelerator-type=v3-8 --version=tpu-vm-tf-2.11.0
$ gcloud compute tpus tpu-vm ssh <TPU_NODE_ID> --zone=<TPU_ZONE>

Training of the models

The following commands are executed in the TPU VM. It is recommended that you run the commands in a Tmux session.

Note: All the necessary files (i.e., pretraining data and config files) need to be stored in a Google Cloud Storage (GCS) bucket in advance.

BERT-base, WordPiece

(vm)$ cd /usr/share/tpu/models/
(vm)$ pip3 install -r official/requirements.txt
(vm)$ export PYTHONPATH=/usr/share/tpu/models
(vm)$ CONFIG_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/configs"
(vm)$ DATA_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/pretraining_data/wordpiece_unidic_lite"
(vm)$ MODEL_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/model/wordpiece_unidic_lite"

# Start training on CC-100
# It will take 6 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_base/training/cc-100 \
--config_file=$CONFIG_DIR/data/cc-100.yaml \
--config_file=$CONFIG_DIR/model/bert_base_wordpiece.yaml \
--params_override="task.train_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

# Continue training on Wikipedia
# It will take 10 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_base/training/cc-100_wikipedia \
--config_file=$CONFIG_DIR/data/wikipedia.yaml \
--config_file=$CONFIG_DIR/model/bert_base_wordpiece.yaml \
--params_override="task.init_checkpoint=$MODEL_DIR/bert_base/training/cc-100,task.train_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

BERT-base, Character

(vm)$ cd /usr/share/tpu/models/
(vm)$ pip3 install -r official/requirements.txt
(vm)$ export PYTHONPATH=/usr/share/tpu/models
(vm)$ CONFIG_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/configs"
(vm)$ DATA_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/pretraining_data/character_unidic_lite"
(vm)$ MODEL_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/model/character_unidic_lite"

# Start training on CC-100
# It will take 6 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_base/training/cc-100 \
--config_file=$CONFIG_DIR/data/cc-100.yaml \
--config_file=$CONFIG_DIR/model/bert_base_character.yaml \
--params_override="task.train_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

# Continue training on Wikipedia
# It will take 10 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_base/training/cc-100_wikipedia \
--config_file=$CONFIG_DIR/data/wikipedia.yaml \
--config_file=$CONFIG_DIR/model/bert_base_character.yaml \
--params_override="task.init_checkpoint=$MODEL_DIR/bert_base/training/cc-100,task.train_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

BERT-large, WordPiece

(vm)$ cd /usr/share/tpu/models/
(vm)$ pip3 install -r official/requirements.txt
(vm)$ export PYTHONPATH=/usr/share/tpu/models
(vm)$ CONFIG_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/configs"
(vm)$ DATA_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/pretraining_data/wordpiece_unidic_lite"
(vm)$ MODEL_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/model/wordpiece_unidic_lite"

# Start training on CC-100
# It will take 23 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_large/training/cc-100 \
--config_file=$CONFIG_DIR/data/cc-100.yaml \
--config_file=$CONFIG_DIR/model/bert_large_wordpiece.yaml \
--params_override="task.train_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

# Continue training on Wikipedia
# It will take 33 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_large/training/cc-100_wikipedia \
--config_file=$CONFIG_DIR/data/wikipedia.yaml \
--config_file=$CONFIG_DIR/model/bert_large_wordpiece.yaml \
--params_override="task.init_checkpoint=$MODEL_DIR/bert_large/training/cc-100,task.train_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

BERT-large, Character

(vm)$ cd /usr/share/tpu/models/
(vm)$ pip3 install -r official/requirements.txt
(vm)$ export PYTHONPATH=/usr/share/tpu/models
(vm)$ CONFIG_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/configs"
(vm)$ DATA_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/pretraining_data/character_unidic_lite"
(vm)$ MODEL_DIR="gs://<GCS_BUCKET_ID>/bert-japanese/model/character_unidic_lite"

# Start training on CC-100
# It will take 23 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_large/training/cc-100 \
--config_file=$CONFIG_DIR/data/cc-100.yaml \
--config_file=$CONFIG_DIR/model/bert_large_character.yaml \
--params_override="task.train_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/cc-100/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

# Continue training on Wikipedia
# It will take 33 days to finish on a v3-8 TPU
(vm)$ python3 official/nlp/train.py \
--tpu=local \
--experiment=bert/pretraining \
--mode=train_and_eval \
--model_dir=$MODEL_DIR/bert_large/training/cc-100_wikipedia \
--config_file=$CONFIG_DIR/data/wikipedia.yaml \
--config_file=$CONFIG_DIR/model/bert_large_character.yaml \
--params_override="task.init_checkpoint=$MODEL_DIR/bert_large/training/cc-100,task.train_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,task.validation_data.input_path=$DATA_DIR/wikipedia/pretraining_data_*.tfrecord,runtime.distribution_strategy=tpu"

Deleting a TPU VM

$ gcloud compute tpus tpu-vm delete <TPU_NODE_ID> --zone=<TPU_ZONE>

Model Conversion

You can convert the TensorFlow model checkpoint to a PyTorch model file.

Note: The model conversion script is designed for the models trained with TensorFlow v2.11.0. The script may not work for models trained with a different version of TensorFlow.

# For BERT-base, WordPiece
$ VOCAB_FILE=$WORK_DIR/tokenizers/wordpiece_unidic_lite/vocab.txt
$ TF_CONFIG_FILE=model_configs/bert_base_wordpiece/config.json
$ HF_CONFIG_DIR=hf_model_configs/bert_base_wordpiece
$ TF_CKPT_PATH=$WORK_DIR/model/wordpiece_unidic_lite/bert_base/training/cc-100_wikipedia/ckpt-1000000
$ OUTPUT_DIR=$WORK_DIR/hf_model/bert-base-japanese-v3

# For BERT-base, Character
$ VOCAB_FILE=$WORK_DIR/tokenizers/character_unidic_lite/vocab.txt
$ TF_CONFIG_FILE=model_configs/bert_base_character/config.json
$ HF_CONFIG_DIR=hf_model_configs/bert_base_character
$ TF_CKPT_PATH=$WORK_DIR/model/character_unidic_lite/bert_base/training/cc-100_wikipedia/ckpt-1000000
$ OUTPUT_DIR=$WORK_DIR/hf_model/bert-base-japanese-char-v3

# For BERT-large, WordPiece
$ VOCAB_FILE=$WORK_DIR/tokenizers/wordpiece_unidic_lite/vocab.txt
$ TF_CONFIG_FILE=model_configs/bert_large_wordpiece/config.json
$ HF_CONFIG_DIR=hf_model_configs/bert_large_wordpiece
$ TF_CKPT_PATH=$WORK_DIR/model/wordpiece_unidic_lite/bert_large/training/cc-100_wikipedia/ckpt-1000000
$ OUTPUT_DIR=$WORK_DIR/hf_model/bert-large-japanese-v2

# For BERT-large, Character
$ VOCAB_FILE=$WORK_DIR/tokenizers/character_unidic_lite/vocab.txt
$ TF_CONFIG_FILE=model_configs/bert_large_character/config.json
$ HF_CONFIG_DIR=hf_model_configs/bert_large_character
$ TF_CKPT_PATH=$WORK_DIR/model/character_unidic_lite/bert_large/training/cc-100_wikipedia/ckpt-1000000
$ OUTPUT_DIR=$WORK_DIR/hf_model/bert-large-japanese-char-v2

# Run the model conversion script
$ mkdir -p $OUTPUT_DIR
$ python convert_tf2_checkpoint_to_all_frameworks.py \
--tf_checkpoint_path $TF_CKPT_PATH \
--tf_config_file $TF_CONFIG_FILE \
--output_path $OUTPUT_DIR
$ cp $HF_CONFIG_DIR/* $OUTPUT_DIR
$ cp $VOCAB_FILE $OUTPUT_DIR

Model Performances

We evaluated our models' performances on the JGLUE benchmark tasks.

For each task, the model is fine-tuned on the training set and evaluated on the development set (the test sets are not publicly available as of this writing.) The hyperparameters were searched within the same set of values as the ones specified in the JGLUE fine-tuning README.

The results of our (informal) experiments are below. Note: These results should be viewed as informative only, since each setting was experimented with only one fixed random seed.

Model MARC-ja JSTS JNLI JSQuAD JCommonsenseQA
Acc. Pearson / Spearman Acc. EM / F1 Acc.
bert-base-japanese-v2 0.958 0.910 / 0.871 0.901 0.869 / 0.939 0.803
bert-base-japanese-v3 New! 0.962 0.919 / 0.881 0.907 0.880 / 0.946 0.848
bert-base-japanese-char-v2 0.957 0.891 / 0.851 0.896 0.870 / 0.938 0.724
bert-base-japanese-char-v3 New! 0.959 0.914 / 0.875 0.903 0.871 / 0.939 0.786
bert-large-japanese 0.958 0.913 / 0.874 0.902 0.881 / 0.946 0.823
bert-large-japanese-v2 New! 0.960 0.926 / 0.893 0.929 0.893 / 0.956 0.893
bert-large-japanese-char 0.958 0.883 / 0.842 0.899 0.870 / 0.938 0.753
bert-large-japanese-char-v2 New! 0.961 0.921 / 0.884 0.910 0.892 / 0.952 0.859

Licenses

The pretrained models and the codes in this repository are distributed under the Apache License 2.0.

Related Work

Acknowledgments

The distributed models are trained with Cloud TPUs provided by TPU Research Cloud program.