Closed tomohideshibata closed 2 years ago
Hi there, sorry for the very long delay. I finally got around to looking at this. Thanks for taking your time to re-run the experiments and for pointing this out.
I can definitely confirm your results, so it seems like the subword-based model is the better choice out of the two after all. I unfortunately did not account for the conversion from character level to word level in the NER data, which of course should be done before applying the subword-tokenized model to have a really fair comparison—among the two Japanese models but also to mBERT. Therefore, choosing the character-based model solely on the preliminary NER runs without making that conversion was not ideal in hindsight. Note that mBERT uses character-based tokenization for Kanji (not for Hiragana and Katakana, though), so the conversion should have less of an impact on the performance of mBERT than of the subword-tokenized Japanese model.
For what it's worth, it doesn't seem to make a difference which of the two Japanese models we use in the experiments in terms of the conclusions we draw. The subword-based tokenizer yields very low scores in fertility and proportion of continued words (it generally seems to be a more effective tokenizer than both the character-based one and mBERT's), and leads to better overall performance when being trained on the very same data, which corroborates the conclusion that the tokenizer should be well-chosen :).
In Section A.1, the authors say that "we select the character-tokenized Japanese BERT model because it achieved considerably higher scores on preliminary NER fine-tuning evaluations.", but from my experience, a subword-tokenized model is consistently better than a character-based model.
I could reproduce the above result for NER, but the Japanese portion of WikiAnn is character-based, and when we use a subword-tokenized model, the dataset has to be converted to word-based as follows:
(I will perform this conversion, and test the subword-tokenized model later.)
All the other datasets are word-based. I have tested the character-tokenized model
cl-tohoku/bert-base-japanese-char
, which is used in the paper, and subword-tokenized modelcl-tohoku/bert-base-japanese
(with only one seed (seed = 1
)). We can see that the subword-tokenized model is consistently better than the character-based model.It would be great if you could confirm this result.