Closed BladeSun closed 2 years ago
It is a problem. In the vocab.txt of ernie-1.0
, it has chinese char started with ##
.
But there are no chinese char ##
in the vocab of ernie-3.0
.
For the pre-training process of ernie, chinese char ##
are never use in the word embedding。For example, ##中
should be treated as 中
.
But for the WWM for chinese char, we still need the additional chinese char ##
to indicate the word segmentation information. Such as 中
##国
,to mask the whole words.
There are two way to solve this problem:
##[\u4E00-\u9FA5]
in your vocab. Be careful, the replaced random word should not be the chinese char ##
. We support the second way in days.
Thanks for your reply! I have another question. Why in this implementation we prefer to use approach two intead of approach one, which maybe like using the ``segment_info'' logic as in:
It seems more efficient to the identify the boundary of a chunk using an index list.
Yes,the origin Paddle/ERNIE is same as the first approach. But we need additional info to indicate the segment_info. example:
美 丽 中 国 -> token_id
0 1 0 1 -> segment_info.(additional storage)
If we use extend vocab
美 ##丽 中 ##国 -> token_id & segment_info
However, the first approach will be more easy to use. We will support it.
Here is the solution for the second approach. FYI
https://github.com/PaddlePaddle/PaddleNLP/blob/0315365dbafa6e3b1c7147121ba85e05884125a5/model_zoo/ernie-1.0/data_tools/create_pretraining_data.py#L268
In this line, the Chinese char started with "##" with be UNK.