cl-tohoku / bert-japanese

BERT models for Japanese text.
Apache License 2.0
514 stars 55 forks source link

[Question] About the Char model #28

Closed AprilSongRits closed 3 years ago

AprilSongRits commented 3 years ago

Hi, thank you for sharing this project. I want to ask the reason for the MeCab tokenization in the Char model. Is there any difference between "directly split into characters" and "first MeCab tokenization and then split into characters"?

singletongue commented 3 years ago

Hi, @AprilSongRits.

The reason for the MeCab tokenization in the char model is that we wanted to apply whole word masking when pretraining the model. Since we used the original pretraining script by Google, applying pre-tokenization was an easy way to apply whole word masking (there might be more efficient ways to implement this, though.)

There will be some difference between the results of (1) direct character splitting and (2) the MeCab pre-tokenization before character splitting. For instance, when the input text contains whitespaces, the method (1) will preserve the whitespaces as split characters, whereas the method (2) will omit such whitespaces in the process of the MeCab tokenization.

AprilSongRits commented 3 years ago

Thank you for your clear reply!