PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.1k stars 2.94k forks source link

question about ernie 1.0 pretrain data creation #2634

Closed BladeSun closed 2 years ago

BladeSun commented 2 years ago

https://github.com/PaddlePaddle/PaddleNLP/blob/0315365dbafa6e3b1c7147121ba85e05884125a5/model_zoo/ernie-1.0/data_tools/create_pretraining_data.py#L268

In this line, the Chinese char started with "##" with be UNK.

ZHUI commented 2 years ago

It is a problem. In the vocab.txt of ernie-1.0, it has chinese char started with ##. But there are no chinese char ## in the vocab of ernie-3.0.

For the pre-training process of ernie, chinese char ## are never use in the word embedding。For example, ##中 should be treated as . But for the WWM for chinese char, we still need the additional chinese char ## to indicate the word segmentation information. Such as ##国,to mask the whole words.

There are two way to solve this problem:

  1. Use additional data to store the word segmentation information.
  2. Temporarily add ##[\u4E00-\u9FA5] in your vocab. Be careful, the replaced random word should not be the chinese char ##.

https://github.com/PaddlePaddle/PaddleNLP/blob/0315365dbafa6e3b1c7147121ba85e05884125a5/model_zoo/ernie-1.0/data_tools/dataset_utils.py#L477-L478

We support the second way in days.

BladeSun commented 2 years ago

Thanks for your reply! I have another question. Why in this implementation we prefer to use approach two intead of approach one, which maybe like using the ``segment_info'' logic as in:

https://github.com/PaddlePaddle/ERNIE/blob/26a16918f3110437bcffb012fe1ac1480d3dbdd8/demo/pretrain/pretrain.py#L127

It seems more efficient to the identify the boundary of a chunk using an index list.

ZHUI commented 2 years ago

Yes,the origin Paddle/ERNIE is same as the first approach. But we need additional info to indicate the segment_info. example:

美 丽 中 国 -> token_id
0 1 0 1 -> segment_info.(additional storage)

If we use extend vocab

美 ##丽 中 ##国 -> token_id & segment_info

However, the first approach will be more easy to use. We will support it.

ZHUI commented 2 years ago

Here is the solution for the second approach. FYI

https://github.com/PaddlePaddle/PaddleNLP/pull/2667

https://github.com/PaddlePaddle/PaddleNLP/blob/70c41894086011f50d7dccc5885860dd4317da10/paddlenlp/transformers/ernie/tokenizer.py#L239-L279