Character-tokenized vs subword-tokenized in Japanese

adapter-hub / hgiyt

Research code for the paper "How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models"

26 stars 6 forks source link

In Section A.1, the authors say that "we select the character-tokenized Japanese BERT model because it achieved considerably higher scores on preliminary NER fine-tuning evaluations.", but from my experience, a subword-tokenized model is consistently better than a character-based model.

I could reproduce the above result for NER, but the Japanese portion of WikiAnn is character-based, and when we use a subword-tokenized model, the dataset has to be converted to word-based as follows:

ja:高 B-LOC ja:島 I-LOC ja:市 I-LOC ja:周 O ja:辺 O ↓ ja:高島 B-LOC ja:市 I-LOC ja:周辺 O

(I will perform this conversion, and test the subword-tokenized model later.)

All the other datasets are word-based. I have tested the character-tokenized model cl-tohoku/bert-base-japanese-char, which is used in the paper, and subword-tokenized model cl-tohoku/bert-base-japanese (with only one seed (seed = 1)). We can see that the subword-tokenized model is consistently better than the character-based model.

	SA	UDP (UAS/LAS)	POS
Monolingual (paper)	88.0	94.7 / 93.0	98.1
character-tokenized (mine)	88.4	94.8 / 93.1	98.1
subword-tokenized (mine)	91.1	95.0 / 93.4	98.2

It would be great if you could confirm this result.

Hi there, sorry for the very long delay. I finally got around to looking at this. Thanks for taking your time to re-run the experiments and for pointing this out.

I can definitely confirm your results, so it seems like the subword-based model is the better choice out of the two after all. I unfortunately did not account for the conversion from character level to word level in the NER data, which of course should be done before applying the subword-tokenized model to have a really fair comparison—among the two Japanese models but also to mBERT. Therefore, choosing the character-based model solely on the preliminary NER runs without making that conversion was not ideal in hindsight. Note that mBERT uses character-based tokenization for Kanji (not for Hiragana and Katakana, though), so the conversion should have less of an impact on the performance of mBERT than of the subword-tokenized Japanese model.

For what it's worth, it doesn't seem to make a difference which of the two Japanese models we use in the experiments in terms of the conclusions we draw. The subword-based tokenizer yields very low scores in fertility and proportion of continued words (it generally seems to be a more effective tokenizer than both the character-based one and mBERT's), and leads to better overall performance when being trained on the very same data, which corroborates the conclusion that the tokenizer should be well-chosen :).

adapter-hub / hgiyt

Character-tokenized vs subword-tokenized in Japanese #4