What's the meaning of TextEncoder.BERT_SPECIAL_COUNT, TextEncoder.TextEncoder.BERT_UNUSED_COUNT

Separius / BERT-keras

Keras implementation of BERT with pre-trained weights

GNU General Public License v3.0

813 stars 196 forks source link

What's the meaning of TextEncoder.BERT_SPECIAL_COUNT, TextEncoder.TextEncoder.BERT_UNUSED_COUNT #16

Open ChiuHsin opened 5 years ago

ChiuHsin commented 5 years ago

When I use the BERT-keras, I don't understand this part: class TextEncoder: PAD_OFFSET = 0 MSK_OFFSET = 1 BOS_OFFSET = 2 DEL_OFFSET = 3 # delimiter EOS_OFFSET = 4 SPECIAL_COUNT = 5 NUM_SEGMENTS = 2 BERT_UNUSED_COUNT = 99 # bert pretrained models BERT_SPECIAL_COUNT = 4 # they don't have DEL Why would you set it up like this? and the BERT_UNUSED_COUNT = 99 BERT_SPECIAL_COUNT = 4 are used in load_google_bert.

Separius commented 5 years ago

Hi, There are some special tokens in the vocabulary(for example BOS stands for Beginning Of Sentence) and we can either put them at the beginning of a lookup table(embedding) or at the end. I decided to put them at the beginning. And for the "UNUSED_COUNT" you can check the vocab files in pretrained BERT models.

Separius commented 5 years ago

Ah, you might be confused by their usage, right? Let's say you want to feed a sentence into your network, so you should add the BOS and EOS tokens to your sentence and you should know their locations in the embedding table

ChiuHsin commented 5 years ago

I see, but when I load_google_bert model, the vocab_size = vocab_size - TextEncoder.BERT_SPECIAL_COUNT - TextEncoder.BERT_UNUSED_COUNT, but it doesn't match when w_id ==2 'weights[w_id][vocab_size + TextEncoder.EOS_OFFSET] = saved[3 + TextEncoder.BERT_UNUSED_COUNT] ' this line can not load the weight.

Separius commented 5 years ago

@ChiuHsin I guess you are right, and it seems that you were able to solve it(based on the other issue you posted) can you please send a pull request to correct this problem? thanks!